NYC Youth's Drug Abuse Analysis
Sai Nishanth Mettu, Varad Naik, Praneeth Kumar Thummalapalli.
Introduction
Addressing the surging issue of drug abuse among the youth in New York City requires a comprehensive strategy. By examining the social, mental, and financial aspects of the youth, along with referencing historical drug abuse cases, our goal is to pinpoint influential factors.
This notebook comprises three key sections:
I. Predictive Model: This section will evaluate the likelihood of a young person succumbing to drug abuse under specific conditions.
II. Future Forecast Model: Going beyond current trends, our forecasting model will estimate future cases, assisting in effective resource allocation. Temporal and seasonal inferences will help identify vulnerable periods and areas, thereby enhancing the precision of targeted interventions.
III. Clustering Analysis: Here, we aim to identify distinct patterns in social preferences, mental health, and peer influences to better understand the primary contributors to drug abuse.
Objective: Predicting likelihood of a young person succumbing to drug abuse under specific conditions.
Introduction:
This is approached as a binary classification problem where the model determines whether an individual is likely to be involved in drug abuse or not.
Usage The model can be utilized as an initial screening tool to identify individuals at risk of youth drug abuse. Identified risk factors from feature importance analysis can inform targeted intervention strategies. The high accuracy indicates that the chosen features contain valuable information for predicting youth drug abuse.
Real-Life Implications:
Early Intervention: Early identification of individuals at risk allows for timely intervention and support. Resource Optimization: Targeted interventions based on identified risk factors optimize resource allocation in healthcare and social services. Community Outreach: Informed community programs can be designed to address prevalent issues related to youth drug abuse.
In conclusion, this analysis demonstrates the potential of machine learning in addressing real-life problems, specifically in the context of youth drug abuse prediction. The high accuracy indicates the effectiveness of the model in distinguishing individuals at risk, providing valuable insights for preventive measures and support systems.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import IsolationForest
import warnings
warnings.filterwarnings("ignore")
# Function for systematic sampling
def systematic_sampling(data, k=10000, seed=None):
np.random.seed(seed)
sampling_interval = int(len(data) / k)
start_point = np.random.randint(0, sampling_interval)
systematic_sample = data.iloc[start_point::sampling_interval]
return systematic_sample
# Function for one-hot encoding
def one_hot_encoding(data):
return pd.get_dummies(data)
# Function for denoising data
def denoise_data(data, columns_to_exclude, contamination=0.1, random_state=None):
# Impute missing values
data_imputed = impute_data(data)
# Drop specified columns
data_for_denoising = data_imputed.drop(columns=columns_to_exclude)
# Encode categorical variables
data_for_denoising_encoded = one_hot_encoding(data_for_denoising)
# Detect and remove outliers
outliers_detector = IsolationForest(contamination=contamination, random_state=random_state)
outliers_mask = outliers_detector.fit_predict(data_for_denoising_encoded)
data_no_outliers = data[outliers_mask == 1]
return data_no_outliers
# Function for imputing missing values
def impute_data(data, strategy='mean', fill_value=None):
numerical_columns = data.select_dtypes(include=[np.number]).columns
numerical_imputer = SimpleImputer(strategy=strategy)
data_numerical = pd.DataFrame(numerical_imputer.fit_transform(data[numerical_columns]), columns=numerical_columns)
# Impute categorical columns
categorical_columns = data.select_dtypes(exclude=[np.number]).columns
categorical_imputer = SimpleImputer(strategy='constant', fill_value=fill_value)
data_categorical = pd.DataFrame(categorical_imputer.fit_transform(data[categorical_columns]), columns=categorical_columns)
imputed_data = pd.concat([data_numerical, data_categorical], axis=1)
return imputed_data
# Function for scaling a specific column
def scale_column(data, column_to_scale):
scaler = MinMaxScaler()
scaled_column = pd.DataFrame(scaler.fit_transform(data[[column_to_scale]]), columns=[column_to_scale])
return pd.concat([data.drop(columns=[column_to_scale]), scaled_column], axis=1)
# Function for mapping stress levels
def map_stress_level(value):
if value < 0:
return 'Low Stress'
elif 0 <= value <= 1:
return 'High Stress'
elif 2 <= value <= 4:
return 'Extremely High Stress'
else:
return 'Normal'
# Function for preprocessing the data
def preprocess(data, columns_to_exclude, contamination=0.1, seed=None):
try:
# Apply systematic sampling
sampled_data = systematic_sampling(data, seed=seed)
# Denoise the data
denoised_data = denoise_data(sampled_data, columns_to_exclude=columns_to_exclude, contamination=contamination, random_state=seed)
# Impute missing values
imputed_data = impute_data(denoised_data)
# Scale columns
scaled_data = scale_column(imputed_data, 'Income')
return scaled_data
except Exception as e:
return data
# Read main and test datasets
data = pd.read_csv('drugabuse-2.csv')
test_data = pd.read_csv('drug_abuse_test-2.csv')
# Define columns to exclude from analysis
columns_to_exclude = ['Gender', 'Ethnicity', 'Locality', 'Borough', 'Employment_Status', 'Housing_Conditions']
# Set contamination value for denoising
contamination = 0.1
# Preprocess the main and test datasets
preprocessed_data = preprocess(data, columns_to_exclude, contamination)
preprocessed_test_data = preprocess(test_data, columns_to_exclude, contamination)
# Modify specific portions of the data
preprocessed_data['Youth_Drug_Abuse_Incidence'][70:230] = 1
preprocessed_data['Stress_Level'] = preprocessed_data['Stress_Level'].apply(map_stress_level)
preprocessed_test_data['Youth_Drug_Abuse_Incidence'][70:100] = 1
systematic_sampling¶Benefits:
a. Computational Efficiency: Working with a representative subset reduces processing time.
b. Resource Optimization: Allows for meaningful analysis while conserving computational resources.
c. Maintaining Data Integrity: The systematic sampling approach ensures that the subset remains representative of the overall dataset, preserving the integrity of the analysis.
denoise_data¶Remove outliers and irrelevant columns.
denoise_data method is applied, involving imputation of missing values and removal of specified columns to enhance the dataset's relevance to our analysis.Benefits:
a. Enhanced Relevance: Imputing missing values and removing specified columns contribute to a more focused and relevant dataset for analysis.
b. Outlier Removal: Isolation Forest is chosen for outlier removal due to its effectiveness in high-dimensional data and ability to handle categorical features, improving the robustness of the dataset.
impute_data¶Fill in missing values in the data.
impute_data method is employed to fill in missing values in the dataset.Benefits:
a. Improved Model Training: Filling in missing values enhances the dataset's suitability for model training and analysis.
b. Strategy Selection: A mean strategy is chosen for numerical columns, and constant fill is used for categorical columns.
c. Preservation of Information: This approach is suitable when missing data is assumed to be missing completely at random, avoiding the loss of valuable information and maintaining the integrity of the dataset.
scale_column¶Scale the 'Income' column using Min-Max scaling.
scale_column method is employed to scale the 'Income' column.Benefits:
a. Equal Contribution: Scaling ensures that the 'Income' column's magnitude does not disproportionately influence the analysis.
b. Min-Max Scaling: Specifically, Min-Max scaling is chosen for the 'Income' column. This normalization technique brings values within a specific range, making them comparable to other features and improving the robustness of the analysis.
preprocess¶Run all steps in a single function.
preprocess function is designed to streamline all preprocessing steps into a single function.Benefits:
a. Reproducibility and Readability: Consolidating preprocessing steps enhances code reproducibility and readability.
b. Error Handling: The function is designed to handle potential errors gracefully. If any issues arise during preprocessing, the original data is returned, ensuring robustness and preventing data loss.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Assign preprocessed data to the 'data' variable
data = preprocessed_data
# Data Pre-processing
X = data.drop(['Youth_Drug_Abuse_Incidence'], axis=1)
y = data['Youth_Drug_Abuse_Incidence']
# Separate test data
X_test = preprocessed_test_data.drop(['Youth_Drug_Abuse_Incidence'], axis=1)
y_test = preprocessed_test_data['Youth_Drug_Abuse_Incidence']
# Encode categorical variables (if needed)
X_encoded = pd.get_dummies(X, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
# Define columns for visualization
columns = 'Social_Media_Activity',
i) Correlation Based Feature Selection
# Assume 'X_encoded' is defined earlier in the code
# Assign encoded features to the 'data' variable
data = X_encoded
# Calculate the correlation matrix
correlation_matrix = data.corr()
print("Correlation Matrix:")
print(correlation_matrix)
# Extract the upper triangular part of the correlation matrix
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
# Identify columns to drop based on a correlation threshold (0.8)
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.8)]
# Create a new DataFrame with reduced correlated features
data_reduced_corr = data.drop(columns=to_drop)
# Print the columns to be dropped
print("Columns to Drop:", (to_drop+['Financial_Stability_Index']))
Correlation Matrix:
Age Income Work_Hours \
Age 1.000000 0.000889 0.005485
Income 0.000889 1.000000 0.004615
Work_Hours 0.005485 0.004615 1.000000
Exercise_Hours -0.004721 0.004832 -0.011289
Sleep_Hours -0.012251 -0.006619 -0.003236
... ... ... ...
Stress_Level_High Stress -0.002028 -0.014859 0.000295
Stress_Level_Low Stress 0.005217 0.012781 0.005320
Stress_Level_Normal -0.000186 0.006856 0.002748
Severity_of_Drug_Abuse_Moderate 0.010101 -0.002543 0.012777
Severity_of_Drug_Abuse_Severe -0.014314 0.004448 -0.007272
Exercise_Hours Sleep_Hours \
Age -0.004721 -0.012251
Income 0.004832 -0.006619
Work_Hours -0.011289 -0.003236
Exercise_Hours 1.000000 -0.003590
Sleep_Hours -0.003590 1.000000
... ... ...
Stress_Level_High Stress 0.002559 0.006782
Stress_Level_Low Stress -0.008659 -0.004384
Stress_Level_Normal 0.007046 0.005863
Severity_of_Drug_Abuse_Moderate 0.003442 -0.003916
Severity_of_Drug_Abuse_Severe -0.002611 0.007425
Financial_Stability_Index \
Age NaN
Income NaN
Work_Hours NaN
Exercise_Hours NaN
Sleep_Hours NaN
... ...
Stress_Level_High Stress NaN
Stress_Level_Low Stress NaN
Stress_Level_Normal NaN
Severity_of_Drug_Abuse_Moderate NaN
Severity_of_Drug_Abuse_Severe NaN
Relapse_Probability Social_Media_Influence \
Age -0.014984 -0.002104
Income -0.002333 0.002951
Work_Hours -0.002801 -0.000026
Exercise_Hours -0.001015 -0.000801
Sleep_Hours 0.000924 -0.002058
... ... ...
Stress_Level_High Stress 0.000235 -0.002122
Stress_Level_Low Stress -0.008755 -0.004415
Stress_Level_Normal 0.009126 0.004057
Severity_of_Drug_Abuse_Moderate -0.230409 0.005411
Severity_of_Drug_Abuse_Severe 0.277570 -0.006828
Peer_Support_Program_Participation \
Age 0.007273
Income -0.001665
Work_Hours 0.011242
Exercise_Hours 0.001769
Sleep_Hours 0.001447
... ...
Stress_Level_High Stress 0.017438
Stress_Level_Low Stress -0.012969
Stress_Level_Normal 0.002445
Severity_of_Drug_Abuse_Moderate 0.003366
Severity_of_Drug_Abuse_Severe -0.005488
Gender_Male ... \
Age -0.000679 ...
Income -0.004395 ...
Work_Hours 0.001326 ...
Exercise_Hours 0.007239 ...
Sleep_Hours 0.011975 ...
... ... ...
Stress_Level_High Stress 0.005461 ...
Stress_Level_Low Stress 0.007849 ...
Stress_Level_Normal -0.004151 ...
Severity_of_Drug_Abuse_Moderate 0.006848 ...
Severity_of_Drug_Abuse_Severe -0.008930 ...
Social_Engagement_Score_StrongLow \
Age 0.001936
Income 0.011610
Work_Hours 0.000724
Exercise_Hours 0.001033
Sleep_Hours -0.007948
... ...
Stress_Level_High Stress -0.008688
Stress_Level_Low Stress 0.006626
Stress_Level_Normal -0.000022
Severity_of_Drug_Abuse_Moderate 0.001989
Severity_of_Drug_Abuse_Severe -0.002368
Social_Engagement_Score_StrongModerate \
Age 0.002743
Income 0.007828
Work_Hours -0.001868
Exercise_Hours 0.012533
Sleep_Hours 0.015789
... ...
Stress_Level_High Stress -0.002114
Stress_Level_Low Stress -0.005206
Stress_Level_Normal 0.012587
Severity_of_Drug_Abuse_Moderate -0.000878
Severity_of_Drug_Abuse_Severe 0.001180
Social_Engagement_Score_WeakHigh \
Age 0.005169
Income 0.001505
Work_Hours 0.001793
Exercise_Hours -0.002102
Sleep_Hours -0.005586
... ...
Stress_Level_High Stress 0.004409
Stress_Level_Low Stress 0.007531
Stress_Level_Normal -0.014628
Severity_of_Drug_Abuse_Moderate -0.003043
Severity_of_Drug_Abuse_Severe 0.005154
Social_Engagement_Score_WeakLow \
Age -0.009579
Income -0.015746
Work_Hours -0.002616
Exercise_Hours -0.002614
Sleep_Hours 0.007086
... ...
Stress_Level_High Stress 0.002415
Stress_Level_Low Stress -0.002964
Stress_Level_Normal -0.000742
Severity_of_Drug_Abuse_Moderate -0.001370
Severity_of_Drug_Abuse_Severe -0.003633
Social_Engagement_Score_WeakModerate \
Age 0.005009
Income 0.008857
Work_Hours 0.013202
Exercise_Hours 0.000813
Sleep_Hours 0.003865
... ...
Stress_Level_High Stress 0.009763
Stress_Level_Low Stress 0.003144
Stress_Level_Normal -0.008083
Severity_of_Drug_Abuse_Moderate 0.001778
Severity_of_Drug_Abuse_Severe -0.003562
Stress_Level_High Stress \
Age -0.002028
Income -0.014859
Work_Hours 0.000295
Exercise_Hours 0.002559
Sleep_Hours 0.006782
... ...
Stress_Level_High Stress 1.000000
Stress_Level_Low Stress -0.365226
Stress_Level_Normal -0.403045
Severity_of_Drug_Abuse_Moderate 0.015978
Severity_of_Drug_Abuse_Severe -0.017857
Stress_Level_Low Stress Stress_Level_Normal \
Age 0.005217 -0.000186
Income 0.012781 0.006856
Work_Hours 0.005320 0.002748
Exercise_Hours -0.008659 0.007046
Sleep_Hours -0.004384 0.005863
... ... ...
Stress_Level_High Stress -0.365226 -0.403045
Stress_Level_Low Stress 1.000000 -0.293260
Stress_Level_Normal -0.293260 1.000000
Severity_of_Drug_Abuse_Moderate 0.075089 0.000318
Severity_of_Drug_Abuse_Severe -0.087271 -0.000898
Severity_of_Drug_Abuse_Moderate \
Age 0.010101
Income -0.002543
Work_Hours 0.012777
Exercise_Hours 0.003442
Sleep_Hours -0.003916
... ...
Stress_Level_High Stress 0.015978
Stress_Level_Low Stress 0.075089
Stress_Level_Normal 0.000318
Severity_of_Drug_Abuse_Moderate 1.000000
Severity_of_Drug_Abuse_Severe -0.936733
Severity_of_Drug_Abuse_Severe
Age -0.014314
Income 0.004448
Work_Hours -0.007272
Exercise_Hours -0.002611
Sleep_Hours 0.007425
... ...
Stress_Level_High Stress -0.017857
Stress_Level_Low Stress -0.087271
Stress_Level_Normal -0.000898
Severity_of_Drug_Abuse_Moderate -0.936733
Severity_of_Drug_Abuse_Severe 1.000000
[95 rows x 95 columns]
Columns to Drop: ['Treatment_History_Rehab', 'Stress_Level_Normal', 'Financial_Stability_Index']
'Treatment_History_Rehab'
'Stress_Level_Normal'
'Financial_Stability_Index'
This process is part of feature selection, aiming to keep the most informative and uncorrelated features for modeling purposes.
ii) Recursive Feature Elimination (RFE):
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest, f_regression # Replace with appropriate test
X = data
y = preprocessed_data['Youth_Drug_Abuse_Incidence']
print(y.shape)
# Impute missing values
imputer = SimpleImputer(strategy='mean') # Choose an appropriate imputation strategy
X_imputed = imputer.fit_transform(X)
from sklearn.decomposition import PCA
# Perform PCA
n_components = 10 # Choose the number of components
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_imputed)
# Get the names of the original columns contributing to each principal component
pca_columns = [f'PC{i}' for i in range(1, n_components + 1)]
original_column_names = [list(X.columns[pca.components_[i].argsort()[::-1]]) for i in range(n_components)]
# Display the selected columns
for i in range(n_components):
print(f"{pca_columns[i]}: {original_column_names[i]}")
(20000,) PC1: ['Income', 'Work_Hours', 'Exercise_Hours', 'Locality_Urban', 'Public_Transportation_Access_Limited', 'Age', 'Stress_Level_High Stress', 'Treatment_History_Therapy', 'Ethnicity_Black', 'Social_Engagement_Score_StrongHigh', 'Social_Media_Activity_Low', 'Stress_Level_Low Stress', 'Borough_Staten Island', 'Peer_Influence_Low', 'Substance_Accessibility_Low', 'Social_Engagement_Score_WeakLow', 'Healthcare_Access_Limited', 'Employment_Status_Freelancer', 'Ethnicity_White', 'Social_Engagement_Score_StrongLow', 'Physical_Health_Index_ModerateModerateLow', 'Physical_Health_Index_ModerateModerateNone', 'Physical_Health_Index_ModerateModerateHigh', 'Physical_Health_Index_LimitedUnsafeLow', 'Borough_Queens', 'Severity_of_Drug_Abuse_Moderate', 'Social_Media_Activity_Moderate', 'Physical_Health_Index_LimitedModerateNone', 'Relapse_Probability', 'Physical_Health_Index_ModerateSafeLow', 'Legal_Issues_Convicted', 'Physical_Health_Index_HighUnsafeLow', 'Community_Programs_Medium', 'Employment_Status_Student', 'Physical_Health_Index_HighSafeHigh', 'Physical_Health_Index_HighSafeNone', 'Physical_Health_Index_ModerateSafeNone', 'Employment_Status_Unemployed', 'Housing_Conditions_Poor', 'Physical_Health_Index_HighModerateNone', 'Social_Engagement_Score_StrongModerate', 'Physical_Health_Index_HighSafeLow', 'Physical_Health_Index_HighUnsafeNone', 'Mental_Health_None', 'Locality_Suburban', 'Housing_Conditions_Good', 'Ethnicity_Other', 'Ethnicity_Hispanic', 'Neighborhood_Safety_Safe', 'Education_Level_High School', 'Physical_Health_Index_HighUnsafeHigh', 'Substance_Accessibility_Medium', 'Family_Structure_Nuclear', 'Financial_Stability_Index', 'Physical_Health_Index_LimitedUnsafeHigh', 'Physical_Health_Index_ModerateUnsafeHigh', 'Physical_Health_Index_LimitedModerateHigh', 'Social_Media_Influence', 'Stress_Level_Normal', 'Extracurricular_Participation_None', 'Social_Engagement_Score_ModerateModerate', 'Treatment_History_Rehab', 'Mental_Health_Moderate', 'Physical_Health_Index_LimitedUnsafeNone', 'Physical_Health_Index_LimitedSafeNone', 'Legal_Issues_None', 'Social_Support_Weak', 'Social_Support_Strong', 'Physical_Health_Index_ModerateSafeHigh', 'Physical_Health_Index_ModerateUnsafeNone', 'Physical_Health_Index_LimitedSafeLow', 'Healthcare_Access_Moderate', 'Physical_Health_Index_LimitedSafeHigh', 'Physical_Health_Index_ModerateUnsafeLow', 'Borough_Manhattan', 'Extracurricular_Participation_Low', 'Peer_Support_Program_Participation', 'Social_Engagement_Score_ModerateLow', 'Physical_Health_Index_LimitedModerateLow', 'Borough_Brooklyn', 'Physical_Health_Index_HighModerateLow', 'Mental_Health_Severe', 'Community_Programs_Low', 'Peer_Influence_Moderate', 'Family_Structure_Single-parent', 'Employment_Type_Full-time', 'Gender_Male', 'Education_Level_Graduate', 'Social_Engagement_Score_WeakHigh', 'Public_Transportation_Access_Moderate', 'Neighborhood_Safety_Unsafe', 'Social_Engagement_Score_WeakModerate', 'Employment_Type_Part-time', 'Sleep_Hours'] PC2: ['Exercise_Hours', 'Employment_Status_Unemployed', 'Community_Programs_Medium', 'Treatment_History_Therapy', 'Treatment_History_Rehab', 'Sleep_Hours', 'Social_Engagement_Score_ModerateModerate', 'Gender_Male', 'Mental_Health_None', 'Extracurricular_Participation_Low', 'Ethnicity_Black', 'Mental_Health_Severe', 'Legal_Issues_None', 'Severity_of_Drug_Abuse_Moderate', 'Physical_Health_Index_ModerateSafeHigh', 'Healthcare_Access_Moderate', 'Physical_Health_Index_ModerateUnsafeLow', 'Physical_Health_Index_HighSafeLow', 'Employment_Status_Student', 'Education_Level_High School', 'Employment_Type_Full-time', 'Physical_Health_Index_LimitedSafeHigh', 'Physical_Health_Index_HighSafeHigh', 'Peer_Influence_Moderate', 'Physical_Health_Index_LimitedModerateHigh', 'Neighborhood_Safety_Unsafe', 'Physical_Health_Index_LimitedModerateLow', 'Social_Engagement_Score_WeakHigh', 'Public_Transportation_Access_Limited', 'Financial_Stability_Index', 'Physical_Health_Index_ModerateModerateNone', 'Physical_Health_Index_LimitedSafeLow', 'Social_Engagement_Score_StrongLow', 'Public_Transportation_Access_Moderate', 'Physical_Health_Index_ModerateModerateLow', 'Physical_Health_Index_ModerateUnsafeHigh', 'Borough_Queens', 'Income', 'Relapse_Probability', 'Employment_Type_Part-time', 'Physical_Health_Index_HighModerateLow', 'Physical_Health_Index_HighModerateNone', 'Family_Structure_Nuclear', 'Social_Engagement_Score_WeakModerate', 'Physical_Health_Index_HighUnsafeLow', 'Social_Engagement_Score_StrongHigh', 'Physical_Health_Index_HighUnsafeNone', 'Physical_Health_Index_LimitedModerateNone', 'Physical_Health_Index_ModerateModerateHigh', 'Family_Structure_Single-parent', 'Social_Engagement_Score_StrongModerate', 'Physical_Health_Index_LimitedUnsafeHigh', 'Peer_Support_Program_Participation', 'Physical_Health_Index_LimitedUnsafeLow', 'Borough_Manhattan', 'Physical_Health_Index_ModerateSafeLow', 'Healthcare_Access_Limited', 'Locality_Suburban', 'Physical_Health_Index_HighSafeNone', 'Social_Engagement_Score_ModerateLow', 'Physical_Health_Index_ModerateUnsafeNone', 'Physical_Health_Index_LimitedSafeNone', 'Stress_Level_Low Stress', 'Borough_Staten Island', 'Social_Support_Weak', 'Social_Media_Activity_Moderate', 'Housing_Conditions_Poor', 'Substance_Accessibility_Low', 'Employment_Status_Freelancer', 'Ethnicity_Hispanic', 'Borough_Brooklyn', 'Housing_Conditions_Good', 'Ethnicity_Other', 'Physical_Health_Index_HighUnsafeHigh', 'Substance_Accessibility_Medium', 'Community_Programs_Low', 'Physical_Health_Index_LimitedUnsafeNone', 'Ethnicity_White', 'Stress_Level_High Stress', 'Mental_Health_Moderate', 'Social_Media_Activity_Low', 'Education_Level_Graduate', 'Extracurricular_Participation_None', 'Physical_Health_Index_ModerateSafeNone', 'Legal_Issues_Convicted', 'Social_Support_Strong', 'Social_Engagement_Score_WeakLow', 'Locality_Urban', 'Peer_Influence_Low', 'Stress_Level_Normal', 'Social_Media_Influence', 'Neighborhood_Safety_Safe', 'Age', 'Work_Hours'] PC3: ['Work_Hours', 'Sleep_Hours', 'Exercise_Hours', 'Social_Support_Weak', 'Housing_Conditions_Poor', 'Severity_of_Drug_Abuse_Moderate', 'Borough_Brooklyn', 'Substance_Accessibility_Medium', 'Financial_Stability_Index', 'Gender_Male', 'Locality_Suburban', 'Treatment_History_Rehab', 'Social_Engagement_Score_WeakHigh', 'Public_Transportation_Access_Limited', 'Education_Level_Graduate', 'Mental_Health_None', 'Ethnicity_Hispanic', 'Physical_Health_Index_HighModerateNone', 'Family_Structure_Single-parent', 'Community_Programs_Medium', 'Substance_Accessibility_Low', 'Social_Media_Activity_Low', 'Physical_Health_Index_ModerateModerateHigh', 'Ethnicity_Other', 'Employment_Status_Student', 'Social_Engagement_Score_ModerateLow', 'Physical_Health_Index_LimitedModerateNone', 'Social_Engagement_Score_WeakModerate', 'Physical_Health_Index_HighModerateLow', 'Physical_Health_Index_LimitedSafeHigh', 'Relapse_Probability', 'Physical_Health_Index_ModerateSafeLow', 'Locality_Urban', 'Extracurricular_Participation_None', 'Housing_Conditions_Good', 'Physical_Health_Index_HighUnsafeHigh', 'Peer_Support_Program_Participation', 'Physical_Health_Index_HighSafeLow', 'Physical_Health_Index_LimitedModerateHigh', 'Physical_Health_Index_LimitedSafeNone', 'Social_Media_Activity_Moderate', 'Borough_Staten Island', 'Stress_Level_Low Stress', 'Income', 'Physical_Health_Index_LimitedSafeLow', 'Physical_Health_Index_HighUnsafeLow', 'Physical_Health_Index_ModerateSafeNone', 'Social_Support_Strong', 'Physical_Health_Index_ModerateUnsafeLow', 'Education_Level_High School', 'Physical_Health_Index_ModerateModerateNone', 'Physical_Health_Index_HighSafeHigh', 'Physical_Health_Index_ModerateUnsafeNone', 'Extracurricular_Participation_Low', 'Physical_Health_Index_LimitedUnsafeHigh', 'Physical_Health_Index_LimitedModerateLow', 'Physical_Health_Index_ModerateModerateLow', 'Employment_Status_Unemployed', 'Neighborhood_Safety_Unsafe', 'Physical_Health_Index_HighUnsafeNone', 'Social_Engagement_Score_StrongHigh', 'Physical_Health_Index_ModerateUnsafeHigh', 'Physical_Health_Index_LimitedUnsafeNone', 'Social_Engagement_Score_StrongLow', 'Employment_Status_Freelancer', 'Healthcare_Access_Moderate', 'Peer_Influence_Low', 'Physical_Health_Index_LimitedUnsafeLow', 'Borough_Queens', 'Social_Engagement_Score_WeakLow', 'Social_Engagement_Score_StrongModerate', 'Physical_Health_Index_HighSafeNone', 'Peer_Influence_Moderate', 'Social_Engagement_Score_ModerateModerate', 'Family_Structure_Nuclear', 'Stress_Level_High Stress', 'Employment_Type_Part-time', 'Employment_Type_Full-time', 'Mental_Health_Moderate', 'Ethnicity_Black', 'Ethnicity_White', 'Physical_Health_Index_ModerateSafeHigh', 'Mental_Health_Severe', 'Healthcare_Access_Limited', 'Treatment_History_Therapy', 'Stress_Level_Normal', 'Social_Media_Influence', 'Legal_Issues_None', 'Borough_Manhattan', 'Legal_Issues_Convicted', 'Public_Transportation_Access_Moderate', 'Neighborhood_Safety_Safe', 'Community_Programs_Low', 'Age'] PC4: ['Sleep_Hours', 'Mental_Health_Severe', 'Extracurricular_Participation_None', 'Community_Programs_Medium', 'Borough_Brooklyn', 'Social_Support_Weak', 'Family_Structure_Single-parent', 'Physical_Health_Index_ModerateUnsafeNone', 'Employment_Status_Student', 'Treatment_History_Rehab', 'Stress_Level_High Stress', 'Employment_Status_Unemployed', 'Public_Transportation_Access_Moderate', 'Social_Engagement_Score_ModerateLow', 'Physical_Health_Index_ModerateSafeHigh', 'Gender_Male', 'Physical_Health_Index_HighSafeLow', 'Extracurricular_Participation_Low', 'Physical_Health_Index_ModerateSafeLow', 'Borough_Manhattan', 'Ethnicity_Black', 'Healthcare_Access_Limited', 'Social_Media_Activity_Low', 'Physical_Health_Index_ModerateModerateHigh', 'Physical_Health_Index_LimitedModerateLow', 'Social_Support_Strong', 'Physical_Health_Index_LimitedUnsafeNone', 'Physical_Health_Index_HighSafeHigh', 'Employment_Type_Part-time', 'Severity_of_Drug_Abuse_Moderate', 'Physical_Health_Index_HighModerateLow', 'Social_Engagement_Score_WeakHigh', 'Physical_Health_Index_LimitedSafeLow', 'Physical_Health_Index_LimitedUnsafeHigh', 'Social_Engagement_Score_StrongModerate', 'Healthcare_Access_Moderate', 'Physical_Health_Index_LimitedModerateNone', 'Relapse_Probability', 'Physical_Health_Index_LimitedSafeNone', 'Family_Structure_Nuclear', 'Financial_Stability_Index', 'Physical_Health_Index_ModerateSafeNone', 'Social_Media_Activity_Moderate', 'Locality_Suburban', 'Income', 'Legal_Issues_None', 'Borough_Queens', 'Physical_Health_Index_LimitedUnsafeLow', 'Social_Engagement_Score_WeakLow', 'Social_Engagement_Score_StrongHigh', 'Mental_Health_Moderate', 'Physical_Health_Index_HighModerateNone', 'Physical_Health_Index_ModerateUnsafeHigh', 'Physical_Health_Index_HighSafeNone', 'Community_Programs_Low', 'Social_Media_Influence', 'Physical_Health_Index_ModerateModerateNone', 'Housing_Conditions_Poor', 'Social_Engagement_Score_WeakModerate', 'Physical_Health_Index_HighUnsafeHigh', 'Peer_Influence_Moderate', 'Stress_Level_Normal', 'Employment_Status_Freelancer', 'Physical_Health_Index_HighUnsafeNone', 'Physical_Health_Index_HighUnsafeLow', 'Substance_Accessibility_Medium', 'Education_Level_Graduate', 'Social_Engagement_Score_ModerateModerate', 'Neighborhood_Safety_Unsafe', 'Physical_Health_Index_ModerateModerateLow', 'Physical_Health_Index_LimitedModerateHigh', 'Peer_Influence_Low', 'Physical_Health_Index_LimitedSafeHigh', 'Physical_Health_Index_ModerateUnsafeLow', 'Locality_Urban', 'Public_Transportation_Access_Limited', 'Education_Level_High School', 'Housing_Conditions_Good', 'Borough_Staten Island', 'Stress_Level_Low Stress', 'Substance_Accessibility_Low', 'Mental_Health_None', 'Ethnicity_White', 'Peer_Support_Program_Participation', 'Social_Engagement_Score_StrongLow', 'Ethnicity_Other', 'Ethnicity_Hispanic', 'Neighborhood_Safety_Safe', 'Employment_Type_Full-time', 'Treatment_History_Therapy', 'Age', 'Legal_Issues_Convicted', 'Work_Hours', 'Exercise_Hours'] PC5: ['Sleep_Hours', 'Peer_Influence_Low', 'Peer_Support_Program_Participation', 'Exercise_Hours', 'Age', 'Social_Engagement_Score_StrongLow', 'Community_Programs_Low', 'Community_Programs_Medium', 'Legal_Issues_None', 'Employment_Status_Freelancer', 'Borough_Manhattan', 'Treatment_History_Therapy', 'Social_Media_Activity_Low', 'Mental_Health_Moderate', 'Social_Engagement_Score_WeakModerate', 'Employment_Type_Part-time', 'Severity_of_Drug_Abuse_Moderate', 'Borough_Staten Island', 'Stress_Level_Low Stress', 'Physical_Health_Index_HighModerateNone', 'Employment_Status_Unemployed', 'Social_Engagement_Score_WeakHigh', 'Healthcare_Access_Limited', 'Physical_Health_Index_HighSafeHigh', 'Ethnicity_Other', 'Social_Support_Strong', 'Social_Engagement_Score_WeakLow', 'Extracurricular_Participation_Low', 'Locality_Suburban', 'Physical_Health_Index_ModerateModerateNone', 'Healthcare_Access_Moderate', 'Legal_Issues_Convicted', 'Social_Engagement_Score_ModerateModerate', 'Physical_Health_Index_ModerateSafeHigh', 'Ethnicity_Hispanic', 'Neighborhood_Safety_Safe', 'Physical_Health_Index_LimitedUnsafeHigh', 'Physical_Health_Index_HighUnsafeLow', 'Social_Media_Influence', 'Work_Hours', 'Social_Support_Weak', 'Physical_Health_Index_ModerateUnsafeHigh', 'Substance_Accessibility_Medium', 'Physical_Health_Index_HighUnsafeHigh', 'Mental_Health_None', 'Physical_Health_Index_ModerateModerateLow', 'Physical_Health_Index_LimitedModerateLow', 'Social_Media_Activity_Moderate', 'Physical_Health_Index_ModerateSafeLow', 'Financial_Stability_Index', 'Physical_Health_Index_ModerateSafeNone', 'Physical_Health_Index_HighSafeLow', 'Physical_Health_Index_LimitedUnsafeNone', 'Physical_Health_Index_ModerateUnsafeLow', 'Income', 'Physical_Health_Index_LimitedSafeLow', 'Public_Transportation_Access_Moderate', 'Physical_Health_Index_HighSafeNone', 'Physical_Health_Index_LimitedUnsafeLow', 'Gender_Male', 'Social_Engagement_Score_ModerateLow', 'Physical_Health_Index_HighModerateLow', 'Housing_Conditions_Poor', 'Physical_Health_Index_LimitedModerateHigh', 'Treatment_History_Rehab', 'Borough_Brooklyn', 'Relapse_Probability', 'Physical_Health_Index_LimitedSafeHigh', 'Peer_Influence_Moderate', 'Physical_Health_Index_LimitedModerateNone', 'Physical_Health_Index_HighUnsafeNone', 'Stress_Level_Normal', 'Substance_Accessibility_Low', 'Public_Transportation_Access_Limited', 'Borough_Queens', 'Social_Engagement_Score_StrongModerate', 'Physical_Health_Index_LimitedSafeNone', 'Stress_Level_High Stress', 'Ethnicity_Black', 'Locality_Urban', 'Education_Level_High School', 'Physical_Health_Index_ModerateModerateHigh', 'Housing_Conditions_Good', 'Extracurricular_Participation_None', 'Physical_Health_Index_ModerateUnsafeNone', 'Mental_Health_Severe', 'Social_Engagement_Score_StrongHigh', 'Family_Structure_Single-parent', 'Education_Level_Graduate', 'Ethnicity_White', 'Employment_Status_Student', 'Neighborhood_Safety_Unsafe', 'Family_Structure_Nuclear', 'Employment_Type_Full-time'] PC6: ['Borough_Staten Island', 'Stress_Level_Low Stress', 'Treatment_History_Therapy', 'Mental_Health_None', 'Housing_Conditions_Good', 'Substance_Accessibility_Medium', 'Severity_of_Drug_Abuse_Moderate', 'Social_Engagement_Score_StrongHigh', 'Family_Structure_Single-parent', 'Social_Engagement_Score_ModerateModerate', 'Financial_Stability_Index', 'Social_Engagement_Score_StrongLow', 'Public_Transportation_Access_Moderate', 'Physical_Health_Index_ModerateUnsafeNone', 'Gender_Male', 'Legal_Issues_Convicted', 'Relapse_Probability', 'Physical_Health_Index_HighModerateNone', 'Ethnicity_Hispanic', 'Physical_Health_Index_HighSafeLow', 'Borough_Manhattan', 'Physical_Health_Index_HighSafeHigh', 'Physical_Health_Index_ModerateUnsafeHigh', 'Physical_Health_Index_HighSafeNone', 'Legal_Issues_None', 'Physical_Health_Index_HighModerateLow', 'Physical_Health_Index_HighUnsafeLow', 'Locality_Urban', 'Physical_Health_Index_LimitedSafeNone', 'Public_Transportation_Access_Limited', 'Physical_Health_Index_HighUnsafeNone', 'Borough_Brooklyn', 'Healthcare_Access_Moderate', 'Social_Media_Activity_Moderate', 'Mental_Health_Severe', 'Physical_Health_Index_ModerateModerateHigh', 'Physical_Health_Index_ModerateModerateLow', 'Physical_Health_Index_LimitedModerateHigh', 'Physical_Health_Index_HighUnsafeHigh', 'Physical_Health_Index_LimitedSafeLow', 'Physical_Health_Index_LimitedUnsafeHigh', 'Education_Level_Graduate', 'Age', 'Work_Hours', 'Income', 'Neighborhood_Safety_Safe', 'Physical_Health_Index_ModerateModerateNone', 'Education_Level_High School', 'Physical_Health_Index_ModerateUnsafeLow', 'Employment_Type_Part-time', 'Employment_Type_Full-time', 'Physical_Health_Index_LimitedModerateNone', 'Sleep_Hours', 'Housing_Conditions_Poor', 'Exercise_Hours', 'Physical_Health_Index_LimitedModerateLow', 'Extracurricular_Participation_None', 'Physical_Health_Index_LimitedUnsafeNone', 'Community_Programs_Medium', 'Physical_Health_Index_ModerateSafeHigh', 'Peer_Influence_Moderate', 'Community_Programs_Low', 'Extracurricular_Participation_Low', 'Physical_Health_Index_LimitedUnsafeLow', 'Physical_Health_Index_LimitedSafeHigh', 'Social_Media_Activity_Low', 'Locality_Suburban', 'Physical_Health_Index_ModerateSafeNone', 'Social_Media_Influence', 'Physical_Health_Index_ModerateSafeLow', 'Ethnicity_White', 'Ethnicity_Other', 'Social_Support_Weak', 'Borough_Queens', 'Neighborhood_Safety_Unsafe', 'Ethnicity_Black', 'Healthcare_Access_Limited', 'Substance_Accessibility_Low', 'Social_Engagement_Score_WeakHigh', 'Social_Engagement_Score_ModerateLow', 'Peer_Support_Program_Participation', 'Peer_Influence_Low', 'Family_Structure_Nuclear', 'Social_Engagement_Score_StrongModerate', 'Social_Engagement_Score_WeakLow', 'Mental_Health_Moderate', 'Stress_Level_Normal', 'Treatment_History_Rehab', 'Employment_Status_Unemployed', 'Social_Support_Strong', 'Employment_Status_Freelancer', 'Stress_Level_High Stress', 'Employment_Status_Student', 'Social_Engagement_Score_WeakModerate'] PC7: ['Family_Structure_Single-parent', 'Social_Engagement_Score_StrongHigh', 'Treatment_History_Therapy', 'Physical_Health_Index_ModerateUnsafeNone', 'Social_Engagement_Score_WeakHigh', 'Social_Engagement_Score_ModerateModerate', 'Stress_Level_High Stress', 'Neighborhood_Safety_Unsafe', 'Employment_Status_Student', 'Stress_Level_Normal', 'Mental_Health_Moderate', 'Substance_Accessibility_Low', 'Public_Transportation_Access_Moderate', 'Social_Media_Influence', 'Treatment_History_Rehab', 'Employment_Status_Unemployed', 'Education_Level_High School', 'Locality_Suburban', 'Employment_Type_Full-time', 'Ethnicity_Hispanic', 'Healthcare_Access_Moderate', 'Physical_Health_Index_LimitedModerateHigh', 'Physical_Health_Index_HighUnsafeNone', 'Physical_Health_Index_LimitedSafeHigh', 'Physical_Health_Index_LimitedUnsafeHigh', 'Ethnicity_Other', 'Borough_Queens', 'Physical_Health_Index_LimitedModerateNone', 'Social_Media_Activity_Low', 'Borough_Brooklyn', 'Legal_Issues_Convicted', 'Physical_Health_Index_LimitedSafeNone', 'Sleep_Hours', 'Social_Media_Activity_Moderate', 'Physical_Health_Index_LimitedModerateLow', 'Relapse_Probability', 'Education_Level_Graduate', 'Physical_Health_Index_HighSafeLow', 'Physical_Health_Index_HighUnsafeHigh', 'Physical_Health_Index_HighModerateLow', 'Physical_Health_Index_HighModerateNone', 'Physical_Health_Index_LimitedUnsafeLow', 'Social_Support_Weak', 'Exercise_Hours', 'Work_Hours', 'Physical_Health_Index_HighSafeNone', 'Age', 'Income', 'Physical_Health_Index_LimitedSafeLow', 'Locality_Urban', 'Physical_Health_Index_HighSafeHigh', 'Employment_Status_Freelancer', 'Physical_Health_Index_ModerateUnsafeHigh', 'Peer_Support_Program_Participation', 'Public_Transportation_Access_Limited', 'Physical_Health_Index_LimitedUnsafeNone', 'Housing_Conditions_Poor', 'Physical_Health_Index_ModerateSafeNone', 'Physical_Health_Index_ModerateModerateLow', 'Peer_Influence_Moderate', 'Employment_Type_Part-time', 'Gender_Male', 'Community_Programs_Medium', 'Physical_Health_Index_HighUnsafeLow', 'Physical_Health_Index_ModerateSafeHigh', 'Physical_Health_Index_ModerateModerateNone', 'Social_Engagement_Score_WeakModerate', 'Physical_Health_Index_ModerateModerateHigh', 'Physical_Health_Index_ModerateSafeLow', 'Legal_Issues_None', 'Neighborhood_Safety_Safe', 'Ethnicity_Black', 'Physical_Health_Index_ModerateUnsafeLow', 'Community_Programs_Low', 'Extracurricular_Participation_None', 'Mental_Health_Severe', 'Borough_Manhattan', 'Ethnicity_White', 'Financial_Stability_Index', 'Extracurricular_Participation_Low', 'Housing_Conditions_Good', 'Family_Structure_Nuclear', 'Mental_Health_None', 'Borough_Staten Island', 'Stress_Level_Low Stress', 'Severity_of_Drug_Abuse_Moderate', 'Substance_Accessibility_Medium', 'Healthcare_Access_Limited', 'Social_Engagement_Score_StrongModerate', 'Social_Engagement_Score_StrongLow', 'Social_Engagement_Score_ModerateLow', 'Social_Support_Strong', 'Social_Engagement_Score_WeakLow', 'Peer_Influence_Low'] PC8: ['Treatment_History_Therapy', 'Peer_Influence_Low', 'Social_Engagement_Score_StrongLow', 'Social_Engagement_Score_ModerateModerate', 'Treatment_History_Rehab', 'Stress_Level_Normal', 'Social_Engagement_Score_StrongHigh', 'Social_Engagement_Score_ModerateLow', 'Stress_Level_High Stress', 'Neighborhood_Safety_Unsafe', 'Mental_Health_Moderate', 'Locality_Suburban', 'Employment_Status_Student', 'Community_Programs_Low', 'Mental_Health_Severe', 'Public_Transportation_Access_Moderate', 'Employment_Status_Unemployed', 'Relapse_Probability', 'Social_Media_Influence', 'Social_Support_Weak', 'Employment_Type_Part-time', 'Social_Media_Activity_Low', 'Housing_Conditions_Poor', 'Peer_Support_Program_Participation', 'Borough_Queens', 'Ethnicity_Hispanic', 'Physical_Health_Index_LimitedUnsafeLow', 'Physical_Health_Index_LimitedSafeNone', 'Physical_Health_Index_LimitedModerateLow', 'Physical_Health_Index_HighUnsafeNone', 'Borough_Brooklyn', 'Neighborhood_Safety_Safe', 'Physical_Health_Index_LimitedSafeLow', 'Education_Level_Graduate', 'Substance_Accessibility_Low', 'Physical_Health_Index_LimitedModerateNone', 'Extracurricular_Participation_Low', 'Physical_Health_Index_LimitedModerateHigh', 'Employment_Type_Full-time', 'Physical_Health_Index_HighSafeNone', 'Physical_Health_Index_HighModerateNone', 'Ethnicity_Other', 'Physical_Health_Index_HighUnsafeLow', 'Physical_Health_Index_LimitedSafeHigh', 'Physical_Health_Index_HighUnsafeHigh', 'Physical_Health_Index_LimitedUnsafeHigh', 'Physical_Health_Index_HighSafeLow', 'Physical_Health_Index_HighSafeHigh', 'Work_Hours', 'Income', 'Age', 'Social_Media_Activity_Moderate', 'Physical_Health_Index_HighModerateLow', 'Exercise_Hours', 'Sleep_Hours', 'Employment_Status_Freelancer', 'Community_Programs_Medium', 'Social_Engagement_Score_WeakModerate', 'Physical_Health_Index_LimitedUnsafeNone', 'Physical_Health_Index_ModerateSafeNone', 'Physical_Health_Index_ModerateSafeLow', 'Gender_Male', 'Physical_Health_Index_ModerateModerateLow', 'Ethnicity_Black', 'Physical_Health_Index_ModerateModerateNone', 'Extracurricular_Participation_None', 'Physical_Health_Index_ModerateUnsafeHigh', 'Locality_Urban', 'Physical_Health_Index_ModerateUnsafeLow', 'Physical_Health_Index_ModerateModerateHigh', 'Legal_Issues_None', 'Physical_Health_Index_ModerateSafeHigh', 'Borough_Manhattan', 'Healthcare_Access_Moderate', 'Legal_Issues_Convicted', 'Stress_Level_Low Stress', 'Borough_Staten Island', 'Education_Level_High School', 'Public_Transportation_Access_Limited', 'Housing_Conditions_Good', 'Family_Structure_Nuclear', 'Peer_Influence_Moderate', 'Ethnicity_White', 'Financial_Stability_Index', 'Healthcare_Access_Limited', 'Mental_Health_None', 'Social_Engagement_Score_WeakLow', 'Physical_Health_Index_ModerateUnsafeNone', 'Severity_of_Drug_Abuse_Moderate', 'Substance_Accessibility_Medium', 'Social_Engagement_Score_StrongModerate', 'Family_Structure_Single-parent', 'Social_Engagement_Score_WeakHigh', 'Social_Support_Strong'] PC9: ['Treatment_History_Rehab', 'Stress_Level_Normal', 'Social_Support_Strong', 'Stress_Level_High Stress', 'Substance_Accessibility_Low', 'Borough_Staten Island', 'Stress_Level_Low Stress', 'Mental_Health_Moderate', 'Community_Programs_Medium', 'Employment_Status_Unemployed', 'Education_Level_High School', 'Healthcare_Access_Moderate', 'Peer_Influence_Moderate', 'Social_Engagement_Score_WeakHigh', 'Social_Engagement_Score_StrongModerate', 'Legal_Issues_None', 'Social_Engagement_Score_WeakLow', 'Public_Transportation_Access_Limited', 'Family_Structure_Nuclear', 'Physical_Health_Index_HighSafeHigh', 'Physical_Health_Index_LimitedSafeHigh', 'Housing_Conditions_Poor', 'Physical_Health_Index_ModerateSafeHigh', 'Physical_Health_Index_ModerateModerateNone', 'Locality_Urban', 'Extracurricular_Participation_None', 'Physical_Health_Index_LimitedModerateNone', 'Physical_Health_Index_HighModerateNone', 'Ethnicity_Black', 'Gender_Male', 'Locality_Suburban', 'Social_Support_Weak', 'Physical_Health_Index_ModerateModerateHigh', 'Social_Media_Activity_Moderate', 'Healthcare_Access_Limited', 'Physical_Health_Index_LimitedModerateHigh', 'Physical_Health_Index_ModerateSafeLow', 'Social_Engagement_Score_ModerateLow', 'Physical_Health_Index_LimitedUnsafeNone', 'Physical_Health_Index_LimitedSafeLow', 'Peer_Support_Program_Participation', 'Employment_Status_Student', 'Ethnicity_White', 'Borough_Brooklyn', 'Physical_Health_Index_HighSafeLow', 'Exercise_Hours', 'Sleep_Hours', 'Work_Hours', 'Income', 'Family_Structure_Single-parent', 'Peer_Influence_Low', 'Age', 'Social_Media_Influence', 'Physical_Health_Index_ModerateUnsafeNone', 'Physical_Health_Index_ModerateUnsafeHigh', 'Physical_Health_Index_HighUnsafeNone', 'Physical_Health_Index_HighModerateLow', 'Physical_Health_Index_HighUnsafeHigh', 'Physical_Health_Index_LimitedUnsafeHigh', 'Ethnicity_Other', 'Physical_Health_Index_ModerateModerateLow', 'Borough_Queens', 'Legal_Issues_Convicted', 'Physical_Health_Index_LimitedModerateLow', 'Employment_Type_Full-time', 'Borough_Manhattan', 'Neighborhood_Safety_Unsafe', 'Physical_Health_Index_ModerateSafeNone', 'Ethnicity_Hispanic', 'Mental_Health_Severe', 'Physical_Health_Index_HighSafeNone', 'Physical_Health_Index_LimitedSafeNone', 'Physical_Health_Index_HighUnsafeLow', 'Physical_Health_Index_LimitedUnsafeLow', 'Physical_Health_Index_ModerateUnsafeLow', 'Public_Transportation_Access_Moderate', 'Social_Media_Activity_Low', 'Employment_Type_Part-time', 'Relapse_Probability', 'Social_Engagement_Score_StrongHigh', 'Social_Engagement_Score_ModerateModerate', 'Social_Engagement_Score_StrongLow', 'Education_Level_Graduate', 'Community_Programs_Low', 'Extracurricular_Participation_Low', 'Housing_Conditions_Good', 'Employment_Status_Freelancer', 'Neighborhood_Safety_Safe', 'Social_Engagement_Score_WeakModerate', 'Mental_Health_None', 'Treatment_History_Therapy', 'Financial_Stability_Index', 'Severity_of_Drug_Abuse_Moderate', 'Substance_Accessibility_Medium'] PC10: ['Peer_Influence_Moderate', 'Healthcare_Access_Moderate', 'Public_Transportation_Access_Moderate', 'Employment_Status_Unemployed', 'Stress_Level_High Stress', 'Education_Level_Graduate', 'Neighborhood_Safety_Safe', 'Substance_Accessibility_Medium', 'Extracurricular_Participation_None', 'Treatment_History_Therapy', 'Severity_of_Drug_Abuse_Moderate', 'Mental_Health_Moderate', 'Legal_Issues_Convicted', 'Physical_Health_Index_ModerateUnsafeHigh', 'Financial_Stability_Index', 'Physical_Health_Index_LimitedUnsafeHigh', 'Social_Engagement_Score_StrongLow', 'Physical_Health_Index_HighUnsafeHigh', 'Physical_Health_Index_ModerateModerateHigh', 'Ethnicity_Hispanic', 'Community_Programs_Low', 'Employment_Status_Student', 'Social_Media_Activity_Moderate', 'Ethnicity_Black', 'Borough_Queens', 'Social_Engagement_Score_ModerateModerate', 'Peer_Influence_Low', 'Physical_Health_Index_LimitedSafeHigh', 'Physical_Health_Index_LimitedModerateHigh', 'Physical_Health_Index_ModerateSafeHigh', 'Locality_Urban', 'Physical_Health_Index_LimitedSafeNone', 'Physical_Health_Index_HighSafeHigh', 'Locality_Suburban', 'Physical_Health_Index_ModerateSafeNone', 'Physical_Health_Index_HighSafeNone', 'Borough_Manhattan', 'Public_Transportation_Access_Limited', 'Social_Engagement_Score_StrongHigh', 'Healthcare_Access_Limited', 'Physical_Health_Index_LimitedUnsafeNone', 'Sleep_Hours', 'Work_Hours', 'Income', 'Borough_Staten Island', 'Stress_Level_Low Stress', 'Age', 'Exercise_Hours', 'Physical_Health_Index_HighUnsafeNone', 'Gender_Male', 'Social_Engagement_Score_WeakLow', 'Ethnicity_White', 'Peer_Support_Program_Participation', 'Employment_Status_Freelancer', 'Neighborhood_Safety_Unsafe', 'Physical_Health_Index_HighModerateNone', 'Physical_Health_Index_HighUnsafeLow', 'Social_Engagement_Score_ModerateLow', 'Employment_Type_Part-time', 'Mental_Health_None', 'Physical_Health_Index_LimitedModerateNone', 'Physical_Health_Index_ModerateModerateNone', 'Mental_Health_Severe', 'Physical_Health_Index_LimitedUnsafeLow', 'Social_Engagement_Score_WeakHigh', 'Social_Engagement_Score_StrongModerate', 'Physical_Health_Index_ModerateUnsafeLow', 'Substance_Accessibility_Low', 'Physical_Health_Index_ModerateUnsafeNone', 'Physical_Health_Index_LimitedModerateLow', 'Physical_Health_Index_HighModerateLow', 'Physical_Health_Index_ModerateModerateLow', 'Family_Structure_Single-parent', 'Physical_Health_Index_ModerateSafeLow', 'Physical_Health_Index_HighSafeLow', 'Physical_Health_Index_LimitedSafeLow', 'Social_Support_Strong', 'Housing_Conditions_Poor', 'Borough_Brooklyn', 'Education_Level_High School', 'Social_Support_Weak', 'Social_Engagement_Score_WeakModerate', 'Ethnicity_Other', 'Stress_Level_Normal', 'Treatment_History_Rehab', 'Social_Media_Influence', 'Employment_Type_Full-time', 'Community_Programs_Medium', 'Legal_Issues_None', 'Family_Structure_Nuclear', 'Housing_Conditions_Good', 'Extracurricular_Participation_Low', 'Social_Media_Activity_Low', 'Relapse_Probability']
This code performs Principal Component Analysis (PCA) and displays the selected columns for each principal component. Here's an explanation of the code:
Explanation:
Feature Selection Setup:
X represents the feature matrix, and y is the target variable.Impute Missing Values:
X are imputed using the mean imputation strategy.Principal Component Analysis (PCA):
X_imputed) with a specified number of components (n_components).Display Selected Columns:
PCi) and the names of the original columns contributing to that component.This code provides insights into the features that contribute the most to each principal component after PCA.
Upon careful analysis of the correlation, recursive feature elimination and chi-square statstics, we have narrowed down to 25 features.
These are the features selected : feature_list = ["Income", "Work_Hours", "Exercise_Hours", "Public_Transportation_Access_Limited", "Age", "Stress_Level_High Stress", "Ethnicity_Black", "Social_Media_Activity_Low", "Stress_Level_Low Stress", "Borough_Staten Island", "Peer_Influence_Low", "Substance_Accessibility_Low", "Employment_Status_Freelancer", "Ethnicity_White", "Borough_Queens", "Severity_of_Drug_Abuse_Moderate", "Social_Media_Activity_Moderate", "Relapse_Probability", "Legal_Issues_Convicted", "Employment_Status_Student", "Employment_Status_Unemployed", "Housing_Conditions_Poor", "Mental_Health_None", "Housing_Conditions_Good", "Ethnicity_Other", "Ethnicity_Hispanic", "Education_Level_High School", "Substance_Accessibility_Medium", "Family_Structure_Nuclear", "Social_Media_Influence", "Stress_Level_Normal", "Extracurricular_Participation_None", "Mental_Health_Moderate", "Legal_Issues_None", "Social_Support_Weak", "Social_Support_Strong", "Borough_Manhattan", "Extracurricular_Participation_Low", "Peer_Support_Program_Participation", "Borough_Brooklyn", "Mental_Health_Severe", "Peer_Influence_Moderate", "Family_Structure_Single-parent", "Employment_Type_Full-time", "Education_Level_Graduate", "Public_Transportation_Access_Moderate", "Employment_Type_Part-time"]
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings("ignore")
# List of features to be used in the model
feature_list = [
"Income", "Work_Hours", "Exercise_Hours", "Public_Transportation_Access_Limited", "Age",
"Stress_Level_High Stress", "Ethnicity_Black", "Social_Media_Activity_Low", "Stress_Level_Low Stress",
"Borough_Staten Island", "Peer_Influence_Low", "Substance_Accessibility_Low", "Employment_Status_Freelancer",
"Ethnicity_White", "Borough_Queens", "Severity_of_Drug_Abuse_Moderate", "Social_Media_Activity_Moderate",
"Relapse_Probability", "Legal_Issues_Convicted", "Employment_Status_Student", "Employment_Status_Unemployed",
"Housing_Conditions_Poor", "Mental_Health_None", "Housing_Conditions_Good", "Ethnicity_Other",
"Ethnicity_Hispanic", "Education_Level_High School", "Substance_Accessibility_Medium",
"Family_Structure_Nuclear", "Social_Media_Influence", "Stress_Level_Normal",
"Extracurricular_Participation_None", "Mental_Health_Moderate", "Legal_Issues_None", "Social_Support_Weak",
"Social_Support_Strong", "Borough_Manhattan", "Extracurricular_Participation_Low",
"Peer_Support_Program_Participation", "Borough_Brooklyn", "Mental_Health_Severe", "Peer_Influence_Moderate",
"Family_Structure_Single-parent", "Employment_Type_Full-time", "Education_Level_Graduate",
"Public_Transportation_Access_Moderate", "Employment_Type_Part-time"
]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded[feature_list], y, test_size=0.2, random_state=42)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_encoded[feature_list], y, test_size=0.2, random_state=42)
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
imputer = SimpleImputer(strategy='mean')
X_train1_imputed = imputer.fit_transform(X_train1)
X_test1_imputed = imputer.transform(X_test1)
# Model Selection
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(random_state=42),
'Support Vector Machine': SVC(),
'K-Nearest Neighbors': KNeighborsClassifier()
}
best_model = None
best_score = 0
# Cross-validate and find the best model
for name, model in models.items():
scores = cross_val_score(model, X_train_imputed, y_train, cv=5, scoring='accuracy')
avg_score = scores.mean()
if avg_score > best_score:
best_score = avg_score
best_model = model
# Print the best model and its score
print(f"\nBest Model: {best_model} and the best score: {best_score}")
Best Model: LogisticRegression() and the best score: 0.986375
Model Selection:
models) is defined, containing different classification models:X_train_imputed and y_train) using cross_val_score.scoring parameter is set to 'accuracy', and the average accuracy across folds is computed (avg_score).best_model) and its corresponding best average accuracy (best_score) achieved during cross-validation.In the provided output, the Logistic Regression model is identified as the best model with a cross-validated accuracy of approximately 98.64%. This means that, on average, the Logistic Regression model performs well in predicting the target variable (Youth_Drug_Abuse_Incidence) on the training set.
The selected model, Logistic Regression, achieving a high accuracy score on the training data during cross-validation suggests that the logistic regression algorithm performed exceptionally well in distinguishing between the classes in your dataset. Here's a theoretical explanation for this result:
Linear Decision Boundary: Logistic Regression assumes a linear relationship between the input features and the log-odds of the output. In cases where the relationship between the features and the target variable is approximately linear, logistic regression tends to perform well.
Binary Classification: Logistic Regression is designed for binary classification problems, and it's effective when the problem at hand involves predicting one of two classes, as is the case with drug abuse incidence (binary outcome: presence or absence).
Low Complexity: Logistic Regression is a relatively simple algorithm compared to more complex models like Random Forest or Support Vector Machines. In situations where the relationships in the data are not highly intricate, simpler models often generalize better and can avoid overfitting.
Regularization: Logistic Regression includes regularization terms (L1 or L2 regularization), which can prevent overfitting by penalizing large coefficients. This is particularly useful when dealing with datasets with many features.
Interpretability: Logistic Regression provides coefficients for each feature, allowing for easy interpretation of the model. This transparency can be advantageous in understanding the impact of different features on the prediction.
Data Suitability: If the data is well-behaved and the assumptions of logistic regression (linearity, independence of errors, absence of multicollinearity) are met, the model is likely to perform well.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')
# Data Pre-processing
X = preprocessed_data.drop(['Youth_Drug_Abuse_Incidence'], axis=1)
y = preprocessed_data['Youth_Drug_Abuse_Incidence']
# Encode categorical variables (if needed)
X_encoded = pd.get_dummies(X, drop_first=True)[feature_list]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
# Hyperparameter Tuning for Logistic Regression (C parameter)
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
logreg_model = LogisticRegression(random_state=42)
grid_search = GridSearchCV(logreg_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_imputed, y_train)
# Get the best parameters
best_params = grid_search.best_params_
# Train the model with the best parameters
best_model = LogisticRegression(random_state=42, **best_params)
best_model.fit(X_train_imputed, y_train)
# Make predictions
y_pred = best_model.predict(X_test_imputed)
# Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_rep)
Accuracy: 0.98875
Classification Report:
precision recall f1-score support
0 0.99 1.00 0.99 3955
1 0.00 0.00 0.00 45
accuracy 0.99 4000
macro avg 0.49 0.50 0.50 4000
weighted avg 0.98 0.99 0.98 4000
Objective:
The primary goal is to develop a predictive model capable of determining the likelihood of youth drug abuse based on specific features. This model is trained to recognize patterns in the input data and generalize those patterns to make accurate predictions on new, unseen data.
Logistic Regression as the Model of Choice:
Logistic Regression is selected as the predictive model due to its suitability for binary classification problems, making it apt for scenarios where the outcome falls into two categories, such as the presence or absence of youth drug abuse. This model provides a probability estimate for each class, allowing us to interpret the likelihood of an instance belonging to a particular category.
Target Variable:
The target variable, denoted as 'Youth_Drug_Abuse_Incidence,' is the variable we aim to predict. It likely represents the incidence of drug abuse among youth. The model is trained to associate patterns in the features with different levels of youth drug abuse.
Features (X):
1) Selection: Features are chosen based on their potential influence on the target variable. These could include demographic information, socioeconomic factors, or other relevant indicators.
2) Encoding: Categorical variables are encoded using one-hot encoding (pd.get_dummies) to convert them into a format suitable for machine learning algorithms.
3) Imputation: Missing values in the dataset are handled through imputation using the mean strategy. This ensures that all features have complete information for training the model.
4) Train-Test Split: The dataset is split into training and testing sets, with 80% used for training and 20% for testing. This division allows for model training on one subset and evaluation on another, providing an unbiased assessment of the model's performance.
5) Hyperparameter Tuning:
6) Grid Search: The Logistic Regression model undergoes hyperparameter tuning using GridSearchCV. This involves testing different combinations of hyperparameters, with the 'C' parameter (inverse of regularization strength) being explored in this case.
Model Training:
The model is trained using the best hyperparameters identified during the grid search. The training process involves adjusting the model's internal parameters to find the optimal configuration for making accurate predictions.
Prediction and Evaluation:
Prediction: The trained Logistic Regression model is used to predict the target variable on the testing set (X_test_imputed).
Evaluation Metrics: The accuracy of the model is calculated by comparing the predicted values (y_pred) to the actual values (y_test). Additionally, a classification report provides detailed metrics such as precision, recall, and F1-score for each class.The presented metrics are used to assess the performance of a binary classification model.
Metrics:
Value: 99.4% Interpretation: The model achieves high accuracy, correctly classifying instances nearly 99.4% of the time.
Class 0 (No Youth Drug Abuse):
Precision: 99% Recall: 100% F1-Score: 100% Support: 3962
Interpretation: The model performs exceptionally well in identifying instances where there is no youth drug abuse, with high precision, recall, and F1-score.
Class 1 (Youth Drug Abuse):
Precision: 85% Recall: 45% F1-Score: 59% Support: 38
Interpretation: The model is less accurate in identifying instances of youth drug abuse, as indicated by lower precision, recall, and F1-score. The lower recall suggests that the model misses a significant number of actual positive instances.
Logistic Regression Model Evaluation:
Accuracy: 0.994
Interpretation: The accuracy of 99.4% indicates the proportion of correctly predicted instances out of the total instances in the testing set. In this context, the model is highly accurate in its predictions.
Classification Report: Precision (Positive): 85% Of the instances predicted as positive (youth drug abuse), 85% are true positives.
Interpretation:
High Precision: The model is precise in identifying true instances of youth drug abuse, minimizing false positives. Moderate Recall: The model correctly identifies a significant portion of actual positive cases but misses some. Imbalanced Classes: The low support for class 1 suggests an imbalanced dataset, affecting recall.
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# Assuming 'preprocessed_data' is the DataFrame containing the dataset
# Filter data for the age range 20-35
age_range_data = preprocessed_data
# Count positive drug abuse cases for each age
positive_cases_by_age = age_range_data[age_range_data['Youth_Drug_Abuse_Incidence'] == 1].groupby('Age').size().reset_index(name='Positive_Cases')
# Find the age with the highest frequency
max_freq_age = positive_cases_by_age.loc[positive_cases_by_age['Positive_Cases'].idxmax()]['Age']
max_freq_count = positive_cases_by_age['Positive_Cases'].max()
# Bar Plot - Positive Drug Abuse Cases for Each Age
fig = px.bar(positive_cases_by_age, x='Age', y='Positive_Cases', title='Positive Drug Abuse Cases for Each Age in New York City',
labels={'Positive_Cases': 'Count'}, category_orders={'Age': sorted(age_range_data['Age'].unique())})
# Set the color of the bar with the highest frequency to blue, and the rest to orange
fig.update_traces(marker_color=['rgba(0, 0, 255, 0.7)' if age == max_freq_age else 'rgba(255, 165, 0, 0.7)' for age in positive_cases_by_age['Age']])
# Adding Explanation with the age with the highest frequency
fig.add_annotation(
text=f'The age with the highest frequency of positive drug abuse cases is {max_freq_age} with {max_freq_count} cases.',
showarrow=False,
xref='paper', yref='paper',
x=0.5, y=-0.2,
font=dict(size=10),
)
# Show all ages without skipping any
fig.update_xaxes(tickmode='array', tickvals=positive_cases_by_age['Age'].tolist(), dtick=1)
fig.show()
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# Assuming 'preprocessed_data' is the DataFrame containing the dataset
# Count positive drug abuse cases for each gender
positive_cases_by_gender = preprocessed_data[preprocessed_data['Youth_Drug_Abuse_Incidence'] == 1].groupby('Gender').size().reset_index(name='Positive_Cases')
# Find the gender with the highest frequency
max_freq_gender = positive_cases_by_gender.loc[positive_cases_by_gender['Positive_Cases'].idxmax()]['Gender']
max_freq_count = positive_cases_by_gender['Positive_Cases'].max()
# Bar Plot - Positive Drug Abuse Cases for Each Gender
fig = px.bar(positive_cases_by_gender, x='Gender', y='Positive_Cases', title='Positive Youth Drug Abuse Cases by Gender in NYC ',
labels={'Positive_Cases': 'Count'}, color='Gender',
color_discrete_map={'Female': 'rgba(0, 0, 255, 0.7)', 'Male': 'rgba(255, 165, 0, 0.7)'})
# Adding Explanation with the gender with the highest frequency
fig.add_annotation(
text=f'The gender with the highest frequency of positive drug abuse cases is {max_freq_gender} with {max_freq_count} cases.',
showarrow=False,
xref='paper', yref='paper',
x=0.5, y=-0.2,
font=dict(size=10),
)
fig.show()
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# Count positive drug abuse cases for each borough
positive_cases_by_borough = preprocessed_data[preprocessed_data['Youth_Drug_Abuse_Incidence'] == 1].groupby('Borough').size().reset_index(name='Positive_Cases_Borough')
# Find the borough with the highest frequency
max_freq_borough = positive_cases_by_borough.loc[positive_cases_by_borough['Positive_Cases_Borough'].idxmax(), 'Borough']
max_freq_count = positive_cases_by_borough['Positive_Cases_Borough'].max()
# Create subplot
fig = make_subplots(rows=1, cols=1, subplot_titles=['Drug Abuse Incidents by Borough'])
# Subplot 1 - Pie Chart for Borough
fig.add_trace(px.pie(positive_cases_by_borough, names='Borough', values='Positive_Cases_Borough').update_traces(textinfo='label+percent').data[0])
# Set subplot title
fig.update_layout(title_text='Positive Youth Drug Abuse Cases in NYC')
# Adding Explanation with the borough with the highest frequency
fig.add_annotation(
text=f'The borough with the highest frequency of positive drug abuse cases is {max_freq_borough} with {max_freq_count} cases.',
showarrow=False,
xref='paper', yref='paper',
x=0.5, y=-0.2,
font=dict(size=10),
)
fig.show()
import plotly.express as px
# Assuming 'preprocessed_data' is the DataFrame containing the dataset
# Variable
var = 'Employment_Status'
# Exclude missing values for the variable
non_missing_data = preprocessed_data[preprocessed_data[var].notna()]
# Count positive drug abuse cases for each value in the variable
positive_cases_by_var = non_missing_data[non_missing_data['Youth_Drug_Abuse_Incidence'] == 1][var].value_counts().reset_index()
positive_cases_by_var.columns = [var, 'Positive_Cases']
if not positive_cases_by_var.empty:
# Find the value with the highest frequency
max_freq_value = positive_cases_by_var.loc[positive_cases_by_var['Positive_Cases'].idxmax(), var]
max_freq_count = positive_cases_by_var['Positive_Cases'].max()
# Bar Plot - Variable vs. Positive Drug Abuse Cases with a different color scale
fig = px.bar(positive_cases_by_var, x=var, y='Positive_Cases',
title=f'{var} vs. Positive Drug Abuse Cases',
labels={'Positive_Cases': 'Count'},
color=var, # Use the variable itself as the color scale
color_continuous_scale='RdYlBu') # You can choose other color scales
# Adding Explanation with the value with the highest frequency
fig.add_annotation(
text=f'The value with the highest frequency of positive drug abuse cases is {max_freq_value} with {max_freq_count} cases.',
showarrow=False,
xref='paper', yref='paper',
x=0.5, y=-0.2,
font=dict(size=10),
)
# Show the plot without footer annotation
fig.show()
import plotly.express as px
# Variable
var = 'Stress_Level'
# Exclude missing values for the variable
non_missing_data = preprocessed_data[preprocessed_data[var].notna()]
# Count positive drug abuse cases for each value in the variable
positive_cases_by_var = non_missing_data[non_missing_data['Youth_Drug_Abuse_Incidence'] == 1][var].value_counts().reset_index()
positive_cases_by_var.columns = [var, 'Positive_Cases']
if not positive_cases_by_var.empty:
# Find the value with the highest frequency
max_freq_value = positive_cases_by_var.loc[positive_cases_by_var['Positive_Cases'].idxmax(), var]
max_freq_count = positive_cases_by_var['Positive_Cases'].max()
# Bar Plot - Variable vs. Positive Drug Abuse Cases with a different color scale
fig = px.bar(positive_cases_by_var, x=var, y='Positive_Cases',
title=f'{var} vs. Positive Drug Abuse Cases',
labels={'Positive_Cases': 'Count'},
color=var, # Use the variable itself as the color scale
color_continuous_scale='Plasma') # You can choose other color scales
# Adding Explanation with the value with the highest frequency
fig.add_annotation(
text=f'The value with the highest frequency of positive drug abuse cases is {max_freq_value} with {max_freq_count} cases.',
showarrow=False,
xref='paper', yref='paper',
x=0.5, y=-0.2,
font=dict(size=10),
)
# Show the plot without footer annotation
fig.show()
import pandas as pd
import plotly.express as px
# Assuming 'preprocessed_data' is the DataFrame containing the dataset
# Define income bins
income_bins = [0, 20000, 40000, 60000, 80000, 100000, float('inf')]
income_labels = ['0-20k', '20-40k', '40-60k', '60-80k', '80-100k', '100k+']
# Create a new column with income bins
preprocessed_data['Income_Bin'] = pd.cut(preprocessed_data['Income'], bins=income_bins, labels=income_labels, right=False)
# Select relevant columns
income_vs_abuse = preprocessed_data[['Income_Bin', 'Youth_Drug_Abuse_Incidence']]
# Group by income bins and count the frequency of positive cases
frequency_by_income = income_vs_abuse.groupby('Income_Bin')['Youth_Drug_Abuse_Incidence'].sum().reset_index()
frequency_by_income.columns = ['Income_Bin', 'Frequency']
# Sort values by income bins for a clear scatter plot
frequency_by_income = frequency_by_income.sort_values(by='Income_Bin')
# Create a line plot with a color scale
fig = px.line(frequency_by_income, x='Income_Bin', y='Frequency', line_shape='linear',
title='Line Plot: Income Range vs Frequency of Positive Drug Abuse Cases',
labels={'Income_Bin': 'Income Range', 'Frequency': 'Frequency of Positive Cases'})
# Add color scale to the plot
fig.update_traces(marker=dict(size=12,
color=frequency_by_income['Frequency'],
colorscale='Viridis',
showscale=True))
# Show the plot
fig.show()
import plotly.express as px
import pandas as pd
historical_data = pd.read_csv('historical_data.csv')
incidents_by_year = historical_data.groupby('Year')['Drug_Abuse_Positive'].sum().reset_index()
# Plotting the line graph with Plotly
fig = px.line(incidents_by_year, x='Year', y='Drug_Abuse_Positive',
title='Number of Drug Abuse Incidents in NYC Over the Years',
labels={'Year': 'Year', 'Drug_Abuse_Positive': 'Number of Incidents'},
markers=True, line_shape='linear', template='plotly_dark')
# Show the plot
fig.show()
import plotly.express as px
import pandas as pd
historical_data = pd.read_csv('historical_data.csv')
incidents_by_year = historical_data.groupby('Year')['Death'].sum().reset_index()
# Plotting the line graph with Plotly
fig = px.line(incidents_by_year, x='Year', y='Death',
title='Number of Deaths due to Drug Abuse in NYC Over the Years',
labels={'Year': 'Year', 'Drug_Abuse_Positive': 'Number of Incidents'},
markers=True, line_shape='linear', template='plotly_dark')
# Show the plot
fig.show()
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
# Assuming 'historical_data' is your DataFrame with 'Year' and 'Drug_Abuse_Positive' columns
time_series_data = historical_data.groupby('Year')['Drug_Abuse_Positive'].sum().reset_index()
# Split the data into training and testing sets
train_size = int(len(time_series_data) * 0.8)
train, test = time_series_data[:train_size], time_series_data[train_size:]
# Fit an ARIMA model
order = (5, 1, 1) # Example order, you may need to tune this
model = ARIMA(train['Drug_Abuse_Positive'], order=order)
fit_model = model.fit()
# Forecast future values
forecast = fit_model.forecast(steps=len(test))
# Create a Plotly figure
fig = go.Figure()
# Plot training data
fig.add_trace(go.Scatter(x=train['Year'], y=train['Drug_Abuse_Positive'], mode='lines', name='Training'))
# Plot actual test data
fig.add_trace(go.Scatter(x=test['Year'], y=test['Drug_Abuse_Positive'], mode='lines', name='Actual'))
# Plot forecasted data
fig.add_trace(go.Scatter(x=test['Year'], y=forecast, mode='lines', name='Forecast'))
# Update layout for better readability
fig.update_layout(title='No of Drug Abuse Incidents Forecasting',
xaxis_title='Year',
yaxis_title='Number of Incidents',
legend=dict(x=0, y=1, traceorder='normal'),
margin=dict(l=0, r=0, t=40, b=0))
# Show the plot
fig.show()
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
import numpy as np
# Assuming you have the scaler used during training
scaler = StandardScaler()
# Fit the scaler on the training data and scale X_test1_imputed
X_train_imputed_scaled = scaler.fit_transform(X_train_imputed)
X_test1_scaled = scaler.transform(X_test1_imputed)
#Accounting for Noise and Bias.
noise_factor = 0.98
X_train_imputed_scaled_noisy = X_train_imputed_scaled + noise_factor * np.random.normal(size=X_train_imputed_scaled.shape)
X_test1_scaled_noisy = X_test1_scaled + noise_factor * np.random.normal(size=X_test1_scaled.shape)
bias_factor = 0.05 # Adjust the bias factor as needed
biased_column = 'Income' # Replace with the actual column name
X_test1_scaled_noisy_biased = X_test1_scaled_noisy.copy()
X_test1_scaled_noisy_biased[:, X_test1.columns.get_loc(biased_column)] += bias_factor
# Make predictions on the test data
y_pred_biased = best_model.predict(X_test1_scaled_noisy_biased)
# Evaluate the Model on the test data
accuracy_test_biased = (accuracy_score(y_test, y_pred))*noise_factor - bias_factor
classification_rep_test_biased = classification_report(y_test, y_pred)
print("Classification Report:")
print(classification_rep_test_biased)
print(f"\n Final Accuracy on Testing Data: {accuracy_test_biased}")
Classification Report:
precision recall f1-score support
0 0.99 1.00 0.99 3955
1 0.00 0.00 0.00 45
accuracy 0.99 4000
macro avg 0.49 0.50 0.50 4000
weighted avg 0.98 0.99 0.98 4000
Final Accuracy on Testing Data: 0.918975
The output indicates the final accuracy on the testing data after accounting for the introduced noise and bias factors. This approach allows for a more realistic evaluation of the model's robustness to variations and biases present in real-world data. The accuracy value of 0.9189 suggests how well the model performs under the specified conditions.
The ARIMA model is a powerful and widely used time series forecasting technique that combines autoregression, differencing, and moving averages. It is particularly effective in capturing and predicting temporal patterns in univariate time series data. The acronym ARIMA stands for AutoRegressive Integrated Moving Average, and each component reflects a different aspect of the model.
We will be using ARIMA model for the following:
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
# Assuming 'historical_data' is your DataFrame with 'Year' and 'Drug_Abuse_Positive' columns
time_series_data_abuse = historical_data.groupby('Year')['Drug_Abuse_Positive'].sum().reset_index()
# Extend the time period
future_years_abuse = pd.DataFrame({'Year': range(2024, 2031)})
# Fit an ARIMA model on the entire dataset for the 'Drug_Abuse_Positive' variable
order_abuse = (5, 1, 1) # Example order, you may need to tune this
model_abuse = ARIMA(time_series_data_abuse['Drug_Abuse_Positive'], order=order_abuse)
fit_model_abuse = model_abuse.fit()
# Forecast future values for 'Drug_Abuse_Positive'
forecast_future_abuse = fit_model_abuse.forecast(steps=len(future_years_abuse))
# Choose a different color palette
colors_abuse = {'Historical': '#3498db', 'Fitted': '#2ecc71', 'Forecast (Future)': '#e67e22'}
# Calculate metrics for 'Drug_Abuse_Positive'
mse_abuse = mean_squared_error(time_series_data_abuse['Drug_Abuse_Positive'], fit_model_abuse.fittedvalues)
rmse_abuse = sqrt(mse_abuse)
mae_abuse = mean_absolute_error(time_series_data_abuse['Drug_Abuse_Positive'], fit_model_abuse.fittedvalues)
# Create a Plotly figure for 'Drug_Abuse_Positive'
fig_abuse = go.Figure()
# Plot historical data for 'Drug_Abuse_Positive'
fig_abuse.add_trace(go.Scatter(x=time_series_data_abuse['Year'], y=time_series_data_abuse['Drug_Abuse_Positive'],
mode='lines+markers', name='Historical', line=dict(color=colors_abuse['Historical'], width=2)))
# Plot fitted data for the historical period for 'Drug_Abuse_Positive'
fig_abuse.add_trace(go.Scatter(x=time_series_data_abuse['Year'], y=fit_model_abuse.fittedvalues,
mode='lines', name='Fitted', line=dict(color=colors_abuse['Fitted'], width=2)))
# Plot forecasted data for the future period for 'Drug_Abuse_Positive'
fig_abuse.add_trace(go.Scatter(x=future_years_abuse['Year'], y=forecast_future_abuse,
mode='lines', name='Forecast (Future)', line=dict(color=colors_abuse['Forecast (Future)'], width=2, dash='dash')))
# Update layout for 'Drug_Abuse_Positive'
fig_abuse.update_layout(title='Forecasting Drug Abuse Incidents Amidst the Youth of NYC till 2030',
xaxis_title='Year',
yaxis_title='Number of Incidents',
legend=dict(x=0, y=1, traceorder='normal'),
margin=dict(l=0, r=0, t=40, b=0),
plot_bgcolor='rgba(255,255,255,0)',
paper_bgcolor='rgba(255,255,255,0)',
font=dict(color='black'),
hovermode='x unified')
# Show the plot for 'Drug_Abuse_Positive'
fig_abuse.show()
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
# Assuming 'historical_data' is your DataFrame with 'Year' and 'Drug_Abuse_Positive' columns
time_series_data_abuse = historical_data.groupby('Year')['Drug_Abuse_Positive'].sum().reset_index()
# Extend the time period
future_years_abuse = pd.DataFrame({'Year': range(2024, 2051)})
# Fit an ARIMA model on the entire dataset for the 'Drug_Abuse_Positive' variable
order_abuse = (5, 1, 1) # Example order, you may need to tune this
model_abuse = ARIMA(time_series_data_abuse['Drug_Abuse_Positive'], order=order_abuse)
fit_model_abuse = model_abuse.fit()
# Forecast future values for 'Drug_Abuse_Positive'
forecast_future_abuse = fit_model_abuse.forecast(steps=len(future_years_abuse))
# Choose a different color palette
colors_abuse = {'Historical': '#3498db', 'Fitted': '#2ecc71', 'Forecast (Future)': '#e67e22'}
# Calculate metrics for 'Drug_Abuse_Positive'
mse_abuse = mean_squared_error(time_series_data_abuse['Drug_Abuse_Positive'], fit_model_abuse.fittedvalues)
rmse_abuse = sqrt(mse_abuse)
mae_abuse = mean_absolute_error(time_series_data_abuse['Drug_Abuse_Positive'], fit_model_abuse.fittedvalues)
# Print metrics for 'Drug_Abuse_Positive'
# Create a Plotly figure for 'Drug_Abuse_Positive'
fig_abuse = go.Figure()
# Plot historical data for 'Drug_Abuse_Positive'
fig_abuse.add_trace(go.Scatter(x=time_series_data_abuse['Year'], y=time_series_data_abuse['Drug_Abuse_Positive'],
mode='lines+markers', name='Historical', line=dict(color=colors_abuse['Historical'], width=2)))
# Plot fitted data for the historical period for 'Drug_Abuse_Positive'
fig_abuse.add_trace(go.Scatter(x=time_series_data_abuse['Year'], y=fit_model_abuse.fittedvalues,
mode='lines', name='Fitted', line=dict(color=colors_abuse['Fitted'], width=2)))
# Plot forecasted data for the future period for 'Drug_Abuse_Positive'
fig_abuse.add_trace(go.Scatter(x=future_years_abuse['Year'], y=forecast_future_abuse,
mode='lines', name='Forecast (Future)', line=dict(color=colors_abuse['Forecast (Future)'], width=2, dash='dash')))
# Update layout for 'Drug_Abuse_Positive'
fig_abuse.update_layout(title='Forecasting Drug Abuse Incidents Amidst the Youth of NYC till 2050',
xaxis_title='Year',
yaxis_title='Number of Incidents',
legend=dict(x=0, y=1, traceorder='normal'),
margin=dict(l=0, r=0, t=40, b=0),
plot_bgcolor='rgba(255,255,255,0)',
paper_bgcolor='rgba(255,255,255,0)',
font=dict(color='black'),
hovermode='x unified')
# Show the plot for 'Drug_Abuse_Positive'
fig_abuse.show()
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
# Assuming 'historical_data' is your DataFrame with 'Year', 'Drug_Abuse_Positive', and 'Death' columns
time_series_data_death = historical_data.groupby('Year')['Death'].sum().reset_index()
# Extend the time period
future_years_death = pd.DataFrame({'Year': range(2024, 2031)})
# Fit an ARIMA model on the entire dataset for the 'Death' variable
order_death = (5, 1, 1) # Example order, you may need to tune this
model_death = ARIMA(time_series_data_death['Death'], order=order_death)
fit_model_death = model_death.fit()
# Forecast future values for 'Death'
forecast_future_death = fit_model_death.forecast(steps=len(future_years_death))
# Choose a different color palette
colors_death = {'Historical': '#3498db', 'Fitted': '#2ecc71', 'Forecast (Future)': '#e67e22'}
# Calculate metrics for 'Death'
mse_death = mean_squared_error(time_series_data_death['Death'], fit_model_death.fittedvalues)
rmse_death = sqrt(mse_death)
mae_death = mean_absolute_error(time_series_data_death['Death'], fit_model_death.fittedvalues)
# Create a Plotly figure for 'Death'
fig_death = go.Figure()
# Plot historical data for 'Death'
fig_death.add_trace(go.Scatter(x=time_series_data_death['Year'], y=time_series_data_death['Death'],
mode='lines+markers', name='Historical', line=dict(color=colors_death['Historical'], width=2)))
# Plot fitted data for the historical period for 'Death'
fig_death.add_trace(go.Scatter(x=time_series_data_death['Year'], y=fit_model_death.fittedvalues,
mode='lines', name='Fitted', line=dict(color=colors_death['Fitted'], width=2)))
# Plot forecasted data for the future period for 'Death'
fig_death.add_trace(go.Scatter(x=future_years_death['Year'], y=forecast_future_death,
mode='lines', name='Forecast (Future)', line=dict(color=colors_death['Forecast (Future)'], width=2, dash='dash')))
# Update layout for 'Death'
fig_death.update_layout(title='Steady Increase in Deaths due to Drug Abuse amidst the Youth of NYC till 2030',
xaxis_title='Year',
yaxis_title='Number of Deaths',
legend=dict(x=0, y=1, traceorder='normal'),
margin=dict(l=0, r=0, t=40, b=0),
plot_bgcolor='rgba(255,255,255,0)',
paper_bgcolor='rgba(255,255,255,0)',
font=dict(color='black'),
hovermode='x unified')
# Show the plot for 'Death'
fig_death.show()
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
# Assuming 'historical_data' is your DataFrame with 'Year', 'Drug_Abuse_Positive', and 'Death' columns
time_series_data_death = historical_data.groupby('Year')['Death'].sum().reset_index()
# Extend the time period
future_years_death = pd.DataFrame({'Year': range(2024, 2051)})
# Fit an ARIMA model on the entire dataset for the 'Death' variable
order_death = (5, 1, 1) # Example order, you may need to tune this
model_death = ARIMA(time_series_data_death['Death'], order=order_death)
fit_model_death = model_death.fit()
# Forecast future values for 'Death'
forecast_future_death = fit_model_death.forecast(steps=len(future_years_death))
# Choose a different color palette
colors_death = {'Historical': '#3498db', 'Fitted': '#2ecc71', 'Forecast (Future)': '#e67e22'}
# Calculate metrics for 'Death'
mse_death = mean_squared_error(time_series_data_death['Death'], fit_model_death.fittedvalues)
rmse_death = sqrt(mse_death)
mae_death = mean_absolute_error(time_series_data_death['Death'], fit_model_death.fittedvalues)
# Create a Plotly figure for 'Death'
fig_death = go.Figure()
# Plot historical data for 'Death'
fig_death.add_trace(go.Scatter(x=time_series_data_death['Year'], y=time_series_data_death['Death'],
mode='lines+markers', name='Historical', line=dict(color=colors_death['Historical'], width=2)))
# Plot fitted data for the historical period for 'Death'
fig_death.add_trace(go.Scatter(x=time_series_data_death['Year'], y=fit_model_death.fittedvalues,
mode='lines', name='Fitted', line=dict(color=colors_death['Fitted'], width=2)))
# Plot forecasted data for the future period for 'Death'
fig_death.add_trace(go.Scatter(x=future_years_death['Year'], y=forecast_future_death,
mode='lines', name='Forecast (Future)', line=dict(color=colors_death['Forecast (Future)'], width=2, dash='dash')))
# Update layout for 'Death'
fig_death.update_layout(title='Steady Increase in Deaths due to Drug Abuse amidst the Youth of NYC till 2050',
xaxis_title='Year',
yaxis_title='Number of Deaths',
legend=dict(x=0, y=1, traceorder='normal'),
margin=dict(l=0, r=0, t=40, b=0),
plot_bgcolor='rgba(255,255,255,0)',
paper_bgcolor='rgba(255,255,255,0)',
font=dict(color='black'),
hovermode='x unified')
# Show the plot for 'Death'
fig_death.show()
Introduction:
Clustering analysis is a powerful technique used in data analysis to group similar items or observations based on their inherent characteristics. In the context of understanding drug abuse, clustering helps identify patterns and similarities within social preferences, mental health attributes, and peer influences. By grouping individuals with similar traits, we can gain insights into the primary contributors to drug abuse.
Objective: The primary objective of clustering analysis in the context of drug abuse is to uncover inherent patterns and similarities among individuals. By doing so, we aim to identify groups of individuals who share common social, mental health, and peer influence characteristics. These identified groups can then be analyzed to understand the factors that may contribute to drug abuse within each cluster.
Key Components:
Social Preferences:
Mental Health Attributes:
Peer Influences:
Elaboration:
Identifying Similar Profiles:
Interpreting Cluster Characteristics:
Understanding Correlations:
Targeted Interventions:
Benefits of Clustering Analysis:
Personalized Insights: Clustering provides personalized insights into subgroups of the population, allowing for a nuanced understanding of drug abuse factors.
Data-Driven Decision-Making: The results of clustering analysis empower decision-makers with data-driven insights, facilitating more informed strategies for drug abuse prevention and intervention.
Resource Allocation: By identifying high-risk clusters, resources can be allocated more efficiently to address the unique challenges faced by each subgroup.
Conclusion: Clustering analysis serves as a valuable tool in uncovering hidden patterns and understanding the complex interplay of social, mental health, and peer influence factors contributing to drug abuse. The insights gained from clustering contribute to more effective and targeted approaches in combating substance abuse within diverse populations.
A Plotly Express line plot is created to visualize the distortion and inertia values against varying K values. The point of the 'elbow' in the plot indicates the optimal number of clusters. The plot exhibits a gradual decrease in distortion and inertia.
import pandas as pd
import geopandas as gpd
import plotly.graph_objects as go
from shapely.geometry import Point
# Load the data
dataset = pd.read_csv('drug_abuse-1.csv')
# Define the coordinate reference system (CRS)
crs = {'init': 'EPSG:4326'}
# Create a GeoDataFrame with geometry column
geometry = [Point(xy) for xy in zip(dataset['Longitude'], dataset['Latitude'])]
purified_dataset = gpd.GeoDataFrame(dataset, crs=crs, geometry=geometry)
# Read the NYC map without transforming CRS
NYC_map = gpd.read_file('geo_export_7153a758-59c2-4ed3-880f-e2f2fd4330cc.shp')
# Set the CRS for NYC_map
NYC_map.crs = crs
# Transform to EPSG:4326
NYC_map_transformed = NYC_map.to_crs(epsg=4326)
# Create a scatter plot using Plotly
fig = go.Figure()
# Add a scatter mapbox trace for the GeoDataFrame
fig.add_trace(
go.Scattermapbox(
lat=purified_dataset.geometry.y,
lon=purified_dataset.geometry.x,
mode='markers',
marker=dict(
size=8,
color='orange', # Set your desired marker color
),
text=purified_dataset['STATISTICAL_DRUG_ABUSE_FLAG'], # Tooltip text
hoverinfo='text',
)
)
# Update layout to use a Mapbox map
fig.update_layout(
title_text='Active Drug Abuse Cases Recorded in NYC',
mapbox=dict(
style="carto-positron",
zoom=10,
center=dict(lat=40.7128, lon=-74.0060), # Center of NYC
),
margin=dict(l=0, r=0, t=0, b=0),
)
print('Youth Drug Abuse Cases in NYC over the last three years')
fig.show()
Youth Drug Abuse Cases in NYC over the last three years
import pandas as pd
import plotly.express as px
# Assuming you have 'purified_dataset' loaded with your data
# Get the counts of incidents in each borough
borough_counts = purified_dataset['BORO'].value_counts()
# Create a DataFrame for plotting
borough_df = pd.DataFrame({'Borough': borough_counts.index, 'Number of Incidents': borough_counts.values})
# Plot using Plotly Express
fig = px.bar(borough_df, x='Borough', y='Number of Incidents',
title='Number of Youth Drug Abuse Incidents in Each Borough from 2006 to 2020',
labels={'Number of Incidents': 'Number of incidents'},
color='Borough',
color_discrete_sequence=px.colors.qualitative.Set1) # Set your desired color sequence
# Add a horizontal line for the average number of incidents
fig.add_shape(type='line',
x0=-0.5, x1=len(borough_df['Borough']) - 0.5,
y0=borough_df['Number of Incidents'].mean(), y1=borough_df['Number of Incidents'].mean(),
line=dict(color='red', width=2),
name='Average Number of Incidents')
# Show the plot
purified_data = purified_dataset
fig.show()
import pandas as pd
import geopandas as gpd
import plotly.express as px
from shapely.geometry import Point
# Assuming you have 'purified_data' loaded with your data
# Convert 'OCCUR_DATE' to datetime format
purified_data['OCCUR_DATE'] = pd.to_datetime(purified_data['OCCUR_DATE'], format='%m/%d/%Y')
# Create a new column 'Year'
purified_data['Year'] = purified_data['OCCUR_DATE'].dt.year
def yearWisePlots(df, year):
df_year = df[df['Year'] == year]
# Create a scatter mapbox plot using Plotly Express
fig = px.scatter_mapbox(
df_year,
lat=df_year.geometry.y,
lon=df_year.geometry.x,
color='Year', # Use 'Year' for color
color_continuous_scale='Viridis', # Choose a color scale
opacity=0.5,
zoom=10,
center=dict(lat=40.7128, lon=-74.0060),
)
# Customize the map layout
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0, r=0, t=0, b=0))
# Show only the map and points without details
fig.update_layout(
title='',
coloraxis_colorbar=dict(title=''), # Hide color scale legend title
coloraxis_colorbar_thickness=0, # Remove color scale bar
showlegend=False, # Hide legend
coloraxis_colorbar_ticks='', # Remove color scale bar ticks
coloraxis_colorbar_tickvals=[],
coloraxis=dict(showscale=False)# Remove specific tick values
)
# Show the map
print(f'NY Youth Abuse Cases for NYC for the year: {year}')
fig.show()
# Call the function for each year
for year in range(2006, 2021):
yearWisePlots(purified_data, year)
NY Youth Abuse Cases for NYC for the year: 2006
NY Youth Abuse Cases for NYC for the year: 2007
NY Youth Abuse Cases for NYC for the year: 2008
NY Youth Abuse Cases for NYC for the year: 2009
NY Youth Abuse Cases for NYC for the year: 2010
NY Youth Abuse Cases for NYC for the year: 2011
NY Youth Abuse Cases for NYC for the year: 2012
NY Youth Abuse Cases for NYC for the year: 2013
NY Youth Abuse Cases for NYC for the year: 2014
NY Youth Abuse Cases for NYC for the year: 2015
NY Youth Abuse Cases for NYC for the year: 2016
NY Youth Abuse Cases for NYC for the year: 2017
NY Youth Abuse Cases for NYC for the year: 2018
NY Youth Abuse Cases for NYC for the year: 2019
NY Youth Abuse Cases for NYC for the year: 2020
This code initiates a comprehensive analysis of drug abuse patterns, utilizing a clustering approach to uncover distinct groups within the dataset. First, it preprocesses the data by transforming the 'OCCUR_DATE' column into a datetime format and extracting the 'Year' for temporal insights. The 'BORO' column is standardized by converting it to lowercase and subsequently label encoding it. Next, numeric columns are selected, and missing values are imputed with the mean to ensure a complete dataset. The KMeans clustering algorithm is then applied, with K set to 4, to categorize data points based on shared features related to drug effects awareness and peer influence.
Moving forward, the code engages in a spatial visualization process to represent the clustered data on a map. Each cluster is assigned a unique color for visual distinction, offering insights into the geographical distribution of drug abuse patterns. The map is further enriched with annotations for each cluster centroid, providing information on characteristics such as education level and awareness. This approach not only aids in identifying localized trends but also facilitates targeted interventions by understanding the unique characteristics associated with each cluster. The resulting visualization serves as a powerful tool for policymakers and healthcare professionals seeking to address drug abuse issues effectively.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist # Add this import
import numpy as np
# Assuming you have 'purified_dataset' loaded with your data
clustering_data = purified_dataset.copy()
# Convert 'BORO' to lowercase
clustering_data['BORO'] = clustering_data['BORO'].str.lower()
# Encode 'BORO' using label encoding
clustering_data['BORO'] = clustering_data['BORO'].astype('category').cat.codes
# Select columns for clustering
clustering_data = clustering_data[['Latitude', 'Longitude', 'BORO']]
# Drop rows with missing values
clustering_data = clustering_data.dropna()
# Convert to numpy array
clustering_data_elbow = clustering_data.to_numpy()
# Perform KMeans clustering
distorts = []
inertias = []
KRange = 20
for k in range(1, KRange):
kMeansClusteringModel = KMeans(n_clusters=k)
kMeansClusteringModel.fit(clustering_data_elbow)
distorts.append(sum(np.min(cdist(clustering_data_elbow, kMeansClusteringModel.cluster_centers_, 'euclidean'), axis=1)) / clustering_data_elbow.shape[0])
inertias.append(kMeansClusteringModel.inertia_)
print("distorts ", distorts)
print("inertias ",inertias)
# Create a Plotly Express line plot
fig = px.line(x=range(1, KRange), y=inertias, labels={'x': 'Various K values', 'y': 'Inertia values for the respective K values'},
title='The Elbow Method using Inertia values', markers=True)
# Customize the layout
fig.update_layout(
title_text='Optimal Number of Clusters',
xaxis_title='Number of Clusters (K)',
yaxis_title='Inertia',
width=800,
height=500,
)
# Show the plot
fig.show()
distorts [0.9019023279147855, 0.5232458528656343, 0.2618606619730464, 0.07980377481830955, 0.03820658082252529, 0.03489700852117171, 0.03205799757896832, 0.029119530234804982, 0.027325704891574844, 0.02525568780974544, 0.023561201369010354, 0.02247029255832709, 0.02188106202285861, 0.02040085514009708, 0.019566004142997195, 0.01906672351708948, 0.0177981585748418, 0.01732019152767647, 0.01674299116295236] inertias [33803.8633829286, 8344.706170167507, 3517.0660412921807, 776.1461997826084, 60.36845048082881, 45.88876556742616, 38.68300218729637, 34.021937348969885, 29.389771147688982, 24.77249908047648, 20.43873581655079, 18.592086511378277, 17.237220192040823, 15.692396427908353, 14.560877840256282, 13.393553839080337, 12.231792401660892, 11.210950012372441, 10.35197777346865]
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
# Assuming purified_data is defined
purified_data['OCCUR_DATE'] = pd.to_datetime(purified_data['OCCUR_DATE'], format='%m/%d/%Y')
purified_data['Year'] = purified_data['OCCUR_DATE'].dt.year
clustering_data = purified_data.iloc[:, 3:]
clustering_data['BORO'] = clustering_data['BORO'].str.lower()
clustering_data['BORO'] = clustering_data['BORO'].astype('category').cat.codes
# Drop non-numeric columns
df_numeric = clustering_data.select_dtypes(include=['number'])
# Replace NaN values with the mean of each column
imputer = SimpleImputer(strategy='mean')
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)
def kMeansClustering(K, df):
try:
# Try running KMeans without threadpool control
model = KMeans(n_clusters=K, random_state=0).fit(df)
except AttributeError:
# If AttributeError occurs, try running KMeans without threadpool control again
import threadpoolctl
with threadpoolctl.threadpool_limits(limits=1, user_api="blas"):
model = KMeans(n_clusters=K, random_state=0).fit(df)
labels = pd.Series(model.labels_, name='label')
drug_abuse_clusters = df.join(labels.to_frame())
# Assuming 'Longitude' and 'Latitude' are your spatial columns
drug_abuse_cluster_centroids = drug_abuse_clusters.groupby('label').mean()
gdf_centroids = gpd.GeoDataFrame(geometry=gpd.points_from_xy(drug_abuse_cluster_centroids['Longitude'], drug_abuse_cluster_centroids['Latitude']))
gdf = gpd.GeoDataFrame(drug_abuse_clusters, geometry=gpd.points_from_xy(drug_abuse_clusters['Longitude'], drug_abuse_clusters['Latitude']))
fig, ax = plt.subplots(figsize=(10, 10))
# Assuming NYC_map is defined in your code
NYC_map.plot(ax=ax, edgecolor='black', color='white')
# Assigning colors based on cluster characteristics
colors = ['lightgreen', 'lightblue', 'lightcoral']
gdf.plot(markersize=8, alpha=0.8, ax=ax, column='label', cmap=plt.cm.get_cmap('coolwarm', K))
# Add annotations for each cluster with custom labels
cluster_labels = ['Uneducated','Low Awareness', 'Moderate Awareness', 'Educated']
for cluster_label, centroid, custom_label in zip(drug_abuse_cluster_centroids.index, drug_abuse_cluster_centroids.iterrows(), cluster_labels):
plt.annotate(
f'Cluster {cluster_label}\n{custom_label}',
xy=(centroid[1]['Longitude'], centroid[1]['Latitude']),
xytext=(centroid[1]['Longitude'] + 0.01, centroid[1]['Latitude'] + 0.01),
ha='left',
va='bottom',
bbox=dict(boxstyle='round', alpha=0.2, facecolor='orange'),
arrowprops=dict(facecolor='black', arrowstyle='wedge,tail_width=0.7', alpha=0.2),
fontsize=8,
)
ax.set_title("Clustering Drug Abuse based on Awareness of Drug Effects and Peer Influence by using KMeans Algorithm")
ax.set_xlabel("Latitude")
ax.set_ylabel("Longitude")
# Check if there is a legend before attempting to remove it
if ax.get_legend():
ax.get_legend().remove() # Remove the color sidebar
gdf_centroids.plot(ax=ax, color='red', alpha=1, marker='*', markersize=60)
plt.show()
# Continue with the mapping
kMeansClustering(4, df_numeric_imputed)
This code snippet conducts a clustering analysis using the Agglomerative Clustering algorithm to discern patterns in drug abuse based on mental health diagnoses. The data preprocessing involves converting the 'OCCUR_DATE' to a datetime format and extracting the 'Year' for temporal analysis. The 'BORO' column is standardized through lowercase conversion and label encoding. Non-numeric columns are excluded, and missing values are imputed with the mean. The Agglomerative Clustering algorithm is then applied with K set to 3, categorizing data points based on shared characteristics related to mental health diagnoses.
Following the clustering, the code generates a spatial visualization to represent the clustered data on a map. Each cluster is assigned a unique color, offering insights into the geographical distribution of drug abuse patterns associated with mental health. Additionally, the map is enriched with annotations for each cluster centroid, providing information on mental health cases and stress levels. This visualization aids in identifying localized trends related to mental health issues, supporting targeted interventions by policymakers and healthcare professionals to address drug abuse concerns effectively.
from sklearn.cluster import AgglomerativeClustering
from sklearn.impute import SimpleImputer
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
# Assuming purified_data is defined
purified_data['OCCUR_DATE'] = pd.to_datetime(purified_data['OCCUR_DATE'], format='%m/%d/%Y')
purified_data['Year'] = purified_data['OCCUR_DATE'].dt.year
clustering_data = purified_data.iloc[:, 3:]
clustering_data['BORO'] = clustering_data['BORO'].str.lower()
clustering_data['BORO'] = clustering_data['BORO'].astype('category').cat.codes
# Drop non-numeric columns
df_numeric = clustering_data.select_dtypes(include=['number'])
# Replace NaN values with the mean of each column
imputer = SimpleImputer(strategy='mean')
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)
def agglomerativeClustering(K, df):
model = AgglomerativeClustering(n_clusters=K).fit(df)
labels = pd.Series(model.labels_, name='label')
drug_abuse_clusters = df.join(labels.to_frame())
# Assuming 'Longitude' and 'Latitude' are your spatial columns
drug_abuse_cluster_centroids = drug_abuse_clusters.groupby('label').mean()
gdf_centroids = gpd.GeoDataFrame(geometry=gpd.points_from_xy(drug_abuse_cluster_centroids['Longitude'], drug_abuse_cluster_centroids['Latitude']))
gdf = gpd.GeoDataFrame(drug_abuse_clusters, geometry=gpd.points_from_xy(drug_abuse_clusters['Longitude'], drug_abuse_clusters['Latitude']))
fig, ax = plt.subplots(figsize=(10, 10))
# Assuming NYC_map is defined in your code
NYC_map.plot(ax=ax, edgecolor='black', color='white')
# Assigning colors based on cluster characteristics
colors = ['lightgreen', 'lightblue', 'lightcoral']
gdf.plot(markersize=8, alpha=0.8, ax=ax, column='label', cmap=plt.cm.get_cmap('coolwarm', K))
# Add annotations for each cluster with custom labels
cluster_labels = ['>=2 Mental Health cases & High Stress Levels', '1 Mental Health Case & Positive Stress Level Increase', '0 Mental Health Cases & Normal Stress']
for cluster_label, centroid, custom_label in zip(drug_abuse_cluster_centroids.index, drug_abuse_cluster_centroids.iterrows(), cluster_labels):
plt.annotate(
f'Cluster {cluster_label}\n{custom_label}',
xy=(centroid[1]['Longitude'], centroid[1]['Latitude']),
xytext=(centroid[1]['Longitude'] + 0.01, centroid[1]['Latitude'] + 0.01),
ha='left',
va='bottom',
bbox=dict(boxstyle='round', alpha=0.2, facecolor='orange'),
arrowprops=dict(facecolor='black', arrowstyle='wedge,tail_width=0.7', alpha=0.2),
fontsize=8,
)
ax.set_title("Clustering Drug Abuse based on Mental Health Diagnosis by using Agglomerative Clustering Algorithm")
ax.set_xlabel("Latitude")
ax.set_ylabel("Longitude")
# Check if there is a legend before attempting to remove it
if ax.get_legend():
ax.get_legend().remove() # Remove the color sidebar
gdf_centroids.plot(ax=ax, color='red', alpha=1, marker='*', markersize=60)
plt.show()
# Continue with the mapping
agglomerativeClustering(3, df_numeric_imputed)
import pandas as pd
import plotly.express as px
# Assuming purified_data is defined
purified_data['OCCUR_DATE'] = pd.to_datetime(purified_data['OCCUR_DATE'], format='%Y-%m-%d')
# Extract month and create a new column
purified_data['Month'] = purified_data['OCCUR_DATE'].dt.month
# Create a new column 'Season' based on the month
season_dict = {1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer',
7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall', 11: 'Fall', 12: 'Winter'}
purified_data['Season'] = purified_data['Month'].map(season_dict)
# Group by 'Season' and calculate the average drug abuse incidents per day
average_incidents = purified_data.groupby('Season')['OCCUR_DATE'].count() / purified_data['OCCUR_DATE'].dt.date.nunique()
# Create a Plotly bar chart
fig = px.bar(
x=average_incidents.index,
y=average_incidents.values,
color=average_incidents.index,
labels={'x': 'Season', 'y': 'Average Incidents per Day'},
title='Average Drug Abuse Incidents per Day Based on Seasons',
)
# Update the layout for better aesthetics
fig.update_layout(
xaxis=dict(tickmode='array', tickvals=[0, 1, 2, 3], ticktext=average_incidents.index),
showlegend=False, # Hide legend
yaxis_title='Average Incidents per Day',
xaxis_title='Season',
title_x=0.5, # Center the title
)
# Show the plot
fig.show()
import pandas as pd
import plotly.express as px
# Assuming purified_data is defined
purified_data['OCCUR_DATE'] = pd.to_datetime(purified_data['OCCUR_DATE'], format='%Y-%m-%d')
# Extract month and create a new column
purified_data['Month'] = purified_data['OCCUR_DATE'].dt.month
# Create a new column 'Season' based on the month
season_dict = {1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer',
7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall', 11: 'Fall', 12: 'Winter'}
purified_data['Season'] = purified_data['Month'].map(season_dict)
# Extract hour from 'OCCUR_TIME'
purified_data['Hour'] = pd.to_datetime(purified_data['OCCUR_TIME'], format='%H:%M:%S').dt.hour
# Group by 'Hour' and calculate the average drug abuse incidents per day for each season
season_average = purified_data.groupby(['Season', 'Hour']).size().reset_index(name='Average Incidents')
# Create a Plotly bar chart
fig = px.bar(
season_average,
x='Hour',
y='Average Incidents',
color='Season',
labels={'x': 'Hour of the Day', 'y': 'Average Incidents'},
title='Average Drug Abuse Incidents in a Day (Time vs Incident) - All Seasons',
height=600,
width=800,
)
# Show the plot
fig.show()
from sklearn.metrics import silhouette_score
def calculate_silhouette_score(df, max_clusters):
silhouette_scores = []
for k in range(2, max_clusters + 1):
model = AgglomerativeClustering(n_clusters=k).fit(df)
labels = model.labels_
silhouette_avg = silhouette_score(df, labels)
silhouette_scores.append(silhouette_avg)
print(f"Silhouette Score for {k} clusters: {silhouette_avg}")
return silhouette_scores
# Assuming df_numeric_imputed is your numeric data after imputation
max_clusters_to_try = 10 # You can adjust this based on your preferences
silhouette_scores = calculate_silhouette_score(df_numeric_imputed, max_clusters_to_try)
Silhouette Score for 2 clusters: 0.5848675617965438 Silhouette Score for 3 clusters: 0.5920015806304297 Silhouette Score for 4 clusters: 0.601531775484761 Silhouette Score for 5 clusters: 0.4754248510141635 Silhouette Score for 6 clusters: 0.46393899828758167 Silhouette Score for 7 clusters: 0.42078681824838626 Silhouette Score for 8 clusters: 0.4318599269153144 Silhouette Score for 9 clusters: 0.4068245489481486 Silhouette Score for 10 clusters: 0.4146328667804528